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PREFACE 

In April of 1959 a Symposium on Information and Decision Processes was held at 
Purdue University. The conference lasted two and one half days, each of the ten 
speakers being allotted half of a morning or afternoon. This book is an outgrowth of 
that conference, in that each speaker has contributed a chapter; some of these chap- 
ters are identical to the conference presentations, while in other cases the authors 
have chosen to modify their papers, or to submit altogether different papers, more 
suitable for a book of this nature. In addition, two of the outstanding papers from the 
1958 Purdue Conference on the same subject are included. 

I wish to take this opportunity to thank the authors for their speedy action (in most 
cases) in submitting their manuscripts, and for their cooperative attitude (in all 
cases) toward the wearying trivialities of my editorship. I also wish to acknowledge 
the outstanding co- chairmanship of Professors Paul Randolph and Judah Rosenblatt 
who arranged and managed the conference. The typists, Loree Lenta and Lisa 
Rosenblatt, deserve thanks for the care, above and beyond the call of duty, which 
they lavished on the manuscript. And my wife, Florence, has my special appreciation 
for encouraging me to take on this task although it meant much extra time alone for 
her. 

Robert E. Machol 
February, I960 
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INTRODUCTION 

In April of 1959, ten of this country's leading scholars forgathered on the campus of 
Purdue University to discuss the nature of information and the nature of decision. 
Their contributions toward these questions appear in this book. What thread ties to- 
gether these contributions? What interests do these men have in common? For what 
reason did they travel collectively so many thousands of miles to come to this con- 
ference? 

To answer these questions it is necessary to view the changing aspect of the scien- 
tific approach to epistemology, and the striking progress which has been wrought in 
the very recent past. The decade from 1940 to 1950 witnessed the operation of the 
first stored- program digital computer. The concept of information was quantified, 
and mathematical theories were developed for communication (Shannon) and decision 
(Wald). Known mathematical techniques were applied to new and important fields, as 
the techniques of complex- variable theory to the analysis of feedback systems and the 
techniques of matrix theory to the analysis of systems under multiple linear con- 
straints. The word "cybernetics 11 was coined, and with it came the realization of the 
many analogies between control and communication in men and in automata. New 
terms like "operations research" and "system engineering" were introduced; despite 
their occasional use by charlatans, they have signified enormous progress in the solu- 
tion of exceedingly complex problems, through the application of quantitative ness and 
objectivity. 

At this time it is difficult to put one's finger on any single contribution in the decade 
1950 - 1960 which is comparable to those above, and yet progress has probably been 
even greater. From the point of view of an educator, one cannot overlook the wide 
distribution which has been given to these ideas. There has been remarkable progress 
from analysis to synthesis, always a sign of maturity in a field of analytic endeavour. 
There has been consolidation, for example in the establishment of a more rigorous 
basis for information theory; there has been unification, for example in the demon- 
stration of the formal similarity between game theory and linear programming; there 
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has been application to mathematically more difficult situations, for example nonlinear 
servo systems and information channels with memory; there has been implementation, 
as in commercially available computers which by any reasonable measure are hun- 
dreds of times more powerful than the primitive devices of 1950; there has been de- 
limitation of the boundaries of many of these fields. 

Yet more significant than any of the above is the enlargement of view - point which 
has taken place. It is difficult to change one's frame of reference, and so one does 
not realize how great are the changes which have taken place in this decade. There 
has been a startling enlargement of the areas which we are now willing to discuss 
scientifically, to include subjects which would have been rejected a few years ago as 
sheer metaphysics. Equally startling is the enlargement in predictions reputable 
computer engineers now confidently predict achievements for their machines which 
would have been defined a few years ago as "thinking 11 , and therefore by definition 
unattainable by machines. 

In fact it Is these very questions within these new areas and subject to these new 
predictions of achievement which are the subject matter of the present book. Be- 
cause we have discovered in this past decade that thinking, and decision, are not sole- 
ly the province of the metaphysicist, but are appropriate subjects for scientific in- 
quiry. They have been rendered so, to a considerable extent, by the very authors of 
this volume. It is no coincidence that of these twelve authors, nine are professional 
mathematicians, and the other three (an economist, a philosopher, and an electrical 
engineer) are well known as competent mathematicians* For it is primarily by means 
of mathematical techniques that these subjects have been brought to the stage where 
objective, quantitative, scientific methodology can be applied to them. Out of that 
methodology will inexorably come the applications and the understanding which are the 
twin goals sought, one or the other, by all the many scholars, be they engineers or 
scientists, who are working so assiduously in this field. 

Wiener has wisely reviewed the old mechanist- vitalist controversy in this new con- 
text, and asserted that it had been relegated to the limbo of badly posed questions. But 
it would seem that we have revived this controversy. And we have revived it with a 
bald mechanist affirmation which would have brought us all to the pyre a few centuries 
back. We assert that it is possible to describe analytically any human function which 
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can be reasonably defined in objective terms and we specifically include in such 
functions "thinking" insofar as that term is definable. If by "thinking" one means be- 
ing able to do arithmetic, or play a good game of chess, or learn from experience, or 
make optimal decisions in exceedingly complex situations, then we assert that thinking 
can be described analytically. And there are two important corollaries: if It can be 
described analytically, it can be simulated; and if it can be simulated, it can be per- 
formed mechanically. 

Two caveats are needed. Of course we cannot do all these things today-- but we 
assert today that these things are achievable in the foreseeable future, and we are 
bent on approaching them. As the space engineer today is confident that he can and 
will put a living man on the surface of Mars within a few years, although he cannot 
say just how or when, so are we confident that we shall build chess-playing machines, 
and decision-making machines, and near-optimal communication systems. And this 
is an advance; such an assertion would have been dismissed as wild and unscientific 
a few years ago, as would a prediction of imminent interplanetary travel. 

The second caveat concerns delimitation of our field of competence, and therefore 
of our field of interest. We are not concerned with those functions which are charac- 
teristic of emotion rather than thought. He would be a fool who would assert today 
that love is purely a matter of biochemistry although he would also be a fool who 
would assert that biochemistry has nothing to do with it. 

An example is my own research interest which centers around the construction of 
mathematical models of the teaching process. I have not thus far achieved significant 
results, but I am convinced not only that I can do so, but that it will be easy. From 
models of this sort it is then hoped that one can go on to the far more difficult, but far 
more useful, task of constructing mathematical models of the learning process. Such 
a research project is made possible in part by modern tools (notably the large digital 
computer), .but also in part by viewpoint. It is a truism to say that a proper statement 
of the problem is a long step towards its solution. What we have done in the past de- 
cade is to restate our basic problem, and to state it in a form which makes a solution 
look feasible. 

This basic problem is that of building a mathematical model of thought processes, 
and in particular of those aspects of thought which are concerned with information and 



decision processes. The perceptron is one type of model -- a set of memory devices 
connected in random fashion-- which has not yet achieved useful results but certainly 
seems to be a promising approach. The self-adaptive feedback control system 
which goes beyond the normal servo function of controlling its output, and in addition 
controls the parameters by which it controls its output is another which has already 
achieved pragmatic results in equipment control. It may be that the question of self- 
adaptation is a key to the whole question of how the human functions in a decision ing 
situation. For in many cases the ability of the human mind to adapt itself to a chang- 
ing and complex environment is beyond our present aims in model construction. 

On the other hand, we have many models which are already an improvement on 
human capabilities. We assert apodictically that in a transportation situation which is 
describable by the familiar linear model, a mathematician with a computer can always 
do at least as well, and usually better, than an experienced man operating purely on 
intuition. Such models lead to brute-force solutions of many interesting problems, 
which the human brain does not solve by brute-force methods. As George Brown 
warns us in an article in this book, we must not place our reliance exclusively on 
brute-force methods, but must also attempt to simulate the gestalt approach which the 
human uses, and which he uses on so many problems with a not inconsiderable degree 
of success. 

Consider the diagram shown here. If one were asked to 
pick the point corresponding to the center of gravity, he 
would have little difficulty in choosing a point which was 
approximately correct, and in being reasonably confident 
that this point was, in fact, approximately correct. On the other hand, he might have 
considerable difficulty explaining just how he had arrived at his choice, or why he was 
confident that it was close to the correct center of gravity. At the present state of ad- 
vance, it would not be possible to program a computer to take the same gestalt ap- 
proach. One could, of course, program the computer to solve the problem by brute- 
forceamounting, in essence, to a curve -tracing procedure followed by numerical 
integration and having done so it would be possible to solve the problem rapidly and 
reliably and with a degree of accuracy which it is quite beyond the unaided human to 
duplicate. (The repeated emphasis on computers arises from the experience that one 




cannot program a problem for a computer until he has stated his criteria and method- 
ology completely, explicitly, and objectively. ) 

Sometimes it is more difficult to formulate the criterion for a problem than to state 
the question itself. Consider the following problem, which is a good deal simpler and 
more explicit than most. One has a sample of size n, drawn from a population known 
to be normal, and one wishes to estimate both parameters. In the case of the mean, it 
is clear that by any reasonable criterion the best estimator is 2 x^/n. But in the 
case of the variance there is no such unique and simple answer. S (x. -x) /(n- 1), 
the formula taught in the usual cookbook course on statistics, is "best" in the sense of 
unbiasedness; Z (x^-x) /n is "best" in the maximum- likelihood sense; and 

S (x. -x) /(n + 1) is "best 11 in the sense of minimum expected value of mean-square 
deviation from the true parameter. How much more difficult, then, to choose an op- 
timum information channel, or make an optimum decision in inventory handling or air 
traffic control system design. 

It is clear that many different approaches must be carried on simultaneously. This 
book is a sampling of the approaches being taken by some of the most brilliant men 
working in this field --men who have already made notable advances and are exceed- 
ingly likely to make more. They vary from the heuristic approach of Flood to the 
formally stated theorems of Wolfowitz; from the meticulous caution of Doob to the 
provocative conjectures of Sob el; from some which are in my opinion exceedingly im- 
portant, such as Shannon's, to others which may prove to be trivial. But they are all 
exciting. For this field of endeavour is the most exciting which any scholar can pur- 
sue-- the study of information and decision processes. 
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COMPUTATION IN DECISION MAKING 
George W. Brown 

The context with -which this paper is concerned is one in which solutions to real-life 
decision problems are sought through application of mathematics. The use of the word 
"computation" in the title implies for us an interest in the process of obtaining usable 
results, to be applied to situations arising in the "real 11 world. In contrast, the com- 
putational process employed experimentally in support of research in Decision Theory 
itself will not be discussed here. As a consequence of its orientation this paper will 
necessarily consist mainly of philosophical reflections on the nature, state, and pros- 
pects of the difficult art of making mathematical applications to the making of deci- 
sions. 

Recent years have witnessed an impressive growth of results, prestige, and future 
expectations associated with what has been termed the "quantitative", "objective 11 , or 
"scientific" approach to problems of the real world. The boom is descended from the 
contacts between the mathematician's world and the real world, established during 
World War II on a scale hitherto unprecedented and since propagated in all directions. 
Accompanying the boom and its tremendous achievements is a whole new vocabulary 
of "O.K. " words (which need no specific citation here) and, unfortunately, some con- 
ceptions which, at the very least, constitute massive oversimplifications of the nature 
and circumstances of the art with which we are here concerned. Recent and probable 
future developments in electronic computational and data processing devices, and the 
spectacular achievements associated with their applications, whet the collective appe- 
tite still further. Thus we are led to the simple picture of a typical adventure in deci- 
sion-making: the problem is first formulated appropriately through construction of a 
mathematical model, next analyzed to obtain a method of solution, and finally solved, 
using a sufficiently powerful computer (if required). In the most simple-minded view 
of the situation the very mention of the term "mathematical model" is almost of itself 
enough to vanquish all difficulties of formulation, and optimism is almost universally 
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warranted in expecting methods of solution to appear, together with machines fast 
enough to produce the solutions. 

This paper will dip into some of the links similar to those by which the real world is 
connected for us to mathematics, with particular reference to decision making, in an 
attempt to gain insight into some pertinent limitations, prospects, and probable future 
Investigations. There will be explorations of the probable failure, often, of the most 
obvious and straightforward techniques to provide solutions within inherent machine 
capabilities, discussion of conceptual inadequacies and other modeling difficulties, 
consideration of alternative approaches involving computers in "simulation 11 and "gam- 
ing" contexts, and some attention to the possible uses of computers in "imaginative" 
processes comprehending "heuristic programming", "artificial intelligence", or 
"learning" programs. The position to be supported is that present expectations asso- 
ciated with the boom described above rest excessively upon the past record of more 
easily obtained results (which naturally arise in the early skimming of a new field of 
problems); that, nevertheless, ultimate fulfillment and surpassing of present expecta- 
tions will certainly occur; but that they will occur as a result of developments still to 
come and not primarily in accord with the simple-minded view stated earlier. 

As a preliminary, let us consider briefly how computers may enter into the solution 
of a decision problem. It has been suggested, by a wag who shall remain anonymous, 
that computers solve decision problems by having the* answers given to them. Proper- 
ly interpreted, this suggestion is nearly correct. In some decision problems, there 
does exist a well-formulated model and a practical algorithm for numerical computa- 
tion, corresponding to the simple-minded view of decision making which was stated 
earlier. In very many problems, however, it may be the case that an algorithm is 
unknown, or known but impractical, or that insufficient information is available, or 
that the simple picture fails to be adequate for any one of a dozen other reasons. In 
these cases, if the computer is to yield a decision there must exist some reasonable 
rule of thumb, as adopted, for example, in certain classes of scheduling problems, or 
some set of approximations which lead to acceptable decision-making behavior. Com- 
putational approaches may be used to develop such approximations, or to aid the intu- 
ition of the ultimate decision-maker in a number of ways. New applications will cer- 
tainly continue to join the first class mentioned, that is, the well-behaved class; but 
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it seems likely that most such new applications will require ever increasing efforts to 
place them there. For this reason it will be increasingly important to develop satis- 
factory techniques for dealing with problems in the second class. 

To illustrate a situation in which an algorithm exists, but so far in unmanageable 
form, consider the game of chess, which should be relatively simple, compared to a 
great many situations in the real world. According to the theory of games, there must 
exist for each player a solution in pure strategies (requiring no random mixing), guar- 
anteeing both Black and White the best outcome that can be guaranteed. This result 
follows from the fact that chess is a member of a class of games characterized as be- 
ing games of complete information. There exists an algorithm, which in principle 
works for all games of this type, and which simply requires isolation of a pair of 
strategies (one for each player) with the following properties: that Black's chosen 
strategy does at least as well as any possible Black strategy against the chosen White 
strategy; and that White's chosen strategy does at least as well as any possible White 
strategy against the chosen Black strategy. With luck one might establish such a pair 
by computing the outcomes of a number of strategy pairs equal to one less than the to- 
tal number of all possible White and all possible Black strategies. The only difficulty 
is that the number of strategies conceivably open to each player is astronomical, with 
the result that the solution is simply unattainable by straight enumeration techniques, 
following the bare prescription of game theory. Of course the possibility exists that 
practically all non-optimal strategies might be susceptible to being ruled out by some 
analytical procedure, thus reducing the problem to manageable proportions. So far, 
this has not happened, and if it should, it would in any case remove the example from 
the class of straightforward, simple-minded applications of theory corresponding to a 
large general class of problems. 

To show further how the number of strategies may mount up in a simple game, let 
us analyze briefly a simple card game which, on the face of it, ought to be consider- 
ably simpler than chess. The game in question, called Gops or Goofspiel[l], has been 
selected because the strategies are somewhat easier to count than in chess. In Goof- 
spiel each of two players starts with a complete suit of 13 cards; a third complete suit 
is shuffled and placed face down; the fourth suit is discarded. Play begins by facing 
up the top card of the table pack, after which each player chooses independently a card 
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from his own hand. Ranking the cards in ascending order from Ace to King, the play- 
er who has played higher captures the exposed table card, together with a number of 
points corresponding to its rank (1 through 13). Play continues in this manner, with a 
new card turned up each time on the table, and with the remaining cards in the players' 
hands diminishing by one each time, until all cards are exhausted after 13 plays. In 
case of a tie at any time the next card is also turned up, and all table cards ride until 
the tie is broken. The winner of the game, of course, is the player who has captured 
the most points. 

Goofspiel is sufficiently interesting so that players might as well relieve themselves 
of the strain of remembering past plays, by the process of leaving exposed cards which 
have been played. Intuitively, the general idea is to "buy 11 the table cards at as low a 
"price 11 as possible, with a strong preference for beating the opponent on individual 
turns by very little, while being beaten by large margins oneself. (Hence the name of 
the game. If I play the queen and my opponent plays the jack, he has "goofed"; but if 
I play the queen and he plays either the king or the deuce, I have "goofed". ) 

It is easily shown that this game does not possess solutions in pure strategies. For 
if there exists a pure strategy for the first player, the second player can know that 
strategy, and can arrange to play king on queen, queen on jack, etc. , winning twelve 
of the thirteen cards (and losing the thirteenth by playing ace on king). Nevertheless, 
the theory of games assures us that there are solutions in mixed strategies, and that 
a number of finite processes exist for .determining a solution. Our purpose here is not 
to estimate the total difficulty of solving the game completely, but rather to discuss 
the colossal number of pure strategies open to each player and to demonstrate the Im- 
possibility even of listing them one by one, much less going on to complete the solution 
by direct methods. 

Before attempting to count the strategies we might review briefly the definition of 
the term, as used In the theory of games. A strategy Is a complete specification, pro- 
viding a unique play for each situation which might conceivably arise during the entire 
game, and existing In advance of any actual play. Thus a single strategy must antici- 
pate all combinations of one's own moves with all moves of the opponent and with every 
possible order of the table pack. Equlvalently, a strategy Is any set of rules which you 
could give to a small boy to enable him to play every game exactly as you would have 



played the same game. 

Taking the simplest method of enumerating possible strategies, we note that on the 
first play we need to allow for 13 different cards which might be exposed, and we 
might specify any one of 13 cards to be played. A little thought shows that a single 
strategy, specifying the first move, consists of a table with an arbitrary entry oppo- 
site each possible exposed card, as in the following example: 

Exposed To be played 

A A 

2 A 

3 7 

4 5 

5 5 

6 2 

7 Q 

8 A 

9 9 
10 K 

J K 

Q K 

K K 

Since each of 13 positions in the right-hand column may be filled in any one of 13 dif- 
ferent ways, there are 13 different tables of this sort, each one a possible candidate 
for specifying the first move. Thus, without getting beyond the first move we observe 
that there are 13 possible ways of starting, each one of which corresponds to all the 
different strategies which agree on the first move. Suppose now that any particular 
one of these 13 starting points had been selected, and consider the situation at the 
second move. The opponent might have played any one of 13 different cards on the 
first play, and any one of 12 cards may now have been turned up on the table; specifi- 
cation of a strategy requires that every one of these 13- 12 * 156 possibilities be anti- 
cipated, following each of the 13 possibilities which was already anticipated by the spe- 
cifications of the opening move. Unfortunately even at the second move we cannot Ig- 
nore the initial 13 possibilities for the exposed table card on the first play, since our 
strategy must now tabulate the moves to be made for every conceivable situation which 
might arise. We obtain 13- 156 2028 situations for which the second move must be 
prepared. The original specification of the first move permits description of the com- 
plete state of the game to date, for each of these 2028 situations, any one of which can 
occur. Noting now that for each of these our player has 12 cards from which to choose, 
we see that a specification for the second move, having chosen a specification for the 



first move, entails preparation of a table with 2028 entries, each of which may take on 
any one of 12 values. This implies that there are 12 possible specifications of the 

second move, for each possible specification of the first move, leading to the ridicu- 
lous number 13 12 for the number of different ways of specifying the first two 
moves. Without proceeding further along this road it is clear that it might pay to in- 
spect the problem to see if some obvious simplification exists. 

For the sake of the curious we shall make such a simplification and investigate the 
consequences. Note that the original game is solvable by providing solutions indepen- 
dently to each of 13 different games, corresponding to the first card exposed in the 
table deck, which, after all, is known to both players before either makes any actual 
play. Since a strategy for the overall game would correspond to any combination of 
strategies, one for each of the 13 particular games, and since each of the latter has- 
the same total number of possible strategies, it may be seen that the number of 
strategies of the overall game will be the 13 th power of the number of strategies of any 
one of the particular games. Offhand it appears easier to solve 13 such games sepa- 
rately than to solve the original game at once. 

Thus we may start again to enumerate strategies, assuming that the first card of 
the table pack is fixed and known. For the first move we find only 13 possibilities for 
our player and for the second move we have only the 13- 12 - 156 different situations 
corresponding to 13 choices of the other player on the first move, and 12 possibilities 
for the second table card to be exposed. Again our player can choose one out of 12 
cards for each of the 156 situations, with the result that there are now only 13 12 156 
ways of specifying the first two moves. Unfortunately each of the 156 situations be- 
comes, on the third move, 12-11 - 132 new situations, corresponding to the 12 selec- 
tions the other player might have made on the second move and to the 11 new possibi- 
lities for the third card to be exposed in the table pack. Since now our player must 
choose from 11 different possibilities in his own hand it appears that the third move 

brings in the juicy factor 11 " , and again it's time to stop. We already have 

156 ..156.132 _,._ 
13 ' 12 ' U different ways to specify how to get through three moves of the 

game, even confining our attention to a particular top card in the table pack. It will be 
near enough for our purposes if we substitute 10 2 ' 00 for the number just obtained, 
and observe that the number of possible strategies for the entire game must be a great 
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deal larger yet. 

For the sake of perspective assume that strategies could be disposed of at one per 

microsecond, then note that there are about 3 x 10 microseconds in a year, or about 

1A 20 000 

10 microseconds in 300 years. There is obviously no hope of denting 10 in 

units of 10 at a time. Expected progress in computer design along the lines already 
established might yield, say, as much as two orders of magnitude improvement in 
speed every five years, so that by a wild extrapolation we might look for 40 orders of 
magnitude improvement in 100 years, and in 300 years 120 orders of magnitude. Ob- 
viously, the figure 10 ' is just as far away as ever. It is almost unbelievable that 
such an innocent- appear ing little game could generate this kind of combinatorial mad- 
ness. And surely this is a simple game compared to many real-life decision problems. 

It might be asked at this point how it is that people can dare to play games of this 
sort. One possible answer is that so many, indeed practically all, of the strategies 
impartially listed above are obviously terrible and as a result, never get considered. 
Of course the optimum strategies probably escape consideration as well. When the 
game is played by real players it is impossible even to say which strategies are being 
used, much less estimate how good their strategies would be against a theoretical so- 
lution of the game. 

Turning from this exercise in futility we shall now touch lightly upon some of the 
difficulties which may be met in modeling a problem for analysis. Consider, for ex- 
ample, the difficulty of determining an objective function for a large, publicly owned 
firm. While the notion of decision-making to maximize profits seems acceptable, it 
may be difficult to make the notion sufficiently precise, except under steady- state 
assumptions. 'Under typical circumstances there are choices which permit trading 
profits in one period for profits in another, characterized by a number of degrees of 
freedom. Moreover there are options concerning the portion of profits to be distri- 
buted and the portion to be retained for capital growth. Taken together with the diverse 
aspirations of the stockholders for long-term gains or for income at various times, 
these circumstances suggest that management does not have a mathematically precise 
criterion for insertion in a model. The notion that management attempts to maximize 
profits may serve to rule out a large number of clearly bad decisions but may be, in 
that simple form, inadequate for comparison of courses of action which are all pretty 
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interesting, and provide practical methods for situations in which the obvious more 

general method would be hopeless. 

co X 2 
For those who may recall some calculus, the evaluation of / e dx provides an 

interesting example. In this case, the direct methods of evaluation, suitable to a 

large class of definite integrals, are cumbersome and tedious. Suppose the unknown 

~2 _ V 2 

integral is called I. Then I - / e~ x dx = / e 7 dy, where the second integral 
* oo 

simply replaces the dummy variable '*" by "y". But now 

I 2 , f/e-^ + ^'dxdy 
o o 

and a transformation to polar coordinates yields 

, T/2.0 _ r 2 
I 2 = / / re dr dG . 

oo , 

2 1 i2 x2 

Of course, re" r has an indefinite integral - - e" (where e" has no indefinite in- 
tegral), and the result I 2 = "^ is obtained, from which I |-/TT a surprising result. 
It is far from clear how the trick was originally motivated; countless students have 
been impressed by it. Similarly mysterious ingenious methods have been used for 
generations in the solution of certain special types of differential equations. 

The familiar game of Nim (and some of its close relatives) in which players alter- 
nate taking matches from one of possibly many piles until the last player removes the 
last match, provides another example. The object of the game may be to take the last 
match, or it may be to force the opponent to do so. In either case, there exists a 
simple method by which one of the players (corresponding to the initial conditions of 
the game) can force a win. Based on this method simple Nim -play ing machines which 
have been constructed exhibit faultless play. The solution rests upon representation 
of the number of matches in each pile as a binary number, on the basis of which it is 
possible to characterize the set of states which a player should seek to establish. Once 
approached in this way the problem is quite simple; in contrast, an attack from the 
point of view of game theory alone would encounter a monstrous number of combinato- 
rial variations. 

A particularly amusing example is furnished by the following game, played with an 
ordinary rectangular table and a large supply of pennies. Starting with an empty table 
two players alternate placing one penny each time, anywhere on the table. Pennies 
may touch one another, but may not overlap, nor may any penny be moved, once 
placed. A penny may overlap the edge, providing it remains on the table. The object 
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of the game is to be the last to be able to put a penny down. This game, which appears 
difficult to analyze, actually has a beautiful solution, at least in principle. If a player 
has first turn and plays in the exact center, he may then mimic any play of his oppon- 
ent, choosing each time the symmetric position on the opposite side of the center. 
This player thus is able after each turn to present his opponent with a symmetric po- 
sition, with the property that any position open to his opponent will have a counterpart 
opposite it, ensuring that the mimicking player can play as long as his opponent can. 
Here again we observe that the solution has a special flavor, deriving from the radial 
symmetry of the table. This solution would obviously not pertain to a table of free 
form. 

The examples dis cursed above provide encouragement of two sorts. In the first 
place, they indicate the value of human analysis and the possibility of human transfor- 
mation of a problem to tractability from a state of apparent impregnability. We can 
continue to expect solutions of the ingenious variety, provided we do not neglect to 
seek them. In the second place, some solutions of the "ingenious" type may actually 
be found by computational approaches. Note, in particular, the role that symmetry 
played in the last example. It is not inconceivable that problem- solving methods cur- 
rently under development [3 ] would permit "invention" of important solutions of this 
sort by high-speed machines. It is appropriate to point out here that people constantly 
solve, in some sense, problems that are insoluble by mathematical standards. Limi- 
tations of time, deficiencies in information, overwhelming combinatorial complexity, 
all these are commonplace in everyday life. Human decision-making is characterized 
imprecisely by the terms "experience, " "intuition, " and "imagination". In ways not 
well understood the functioning human compromises on a wholesale scale, reducing to 
manageable size the number of alternatives he will entertain, perhaps choosing a de- 
cision first and then rationalizing it, perhaps seeking a new decision if the rationaliza- 
tion fails to come off. We might profitably ask for ways in which computers can aid 
people in improving their decision-making, rather than insist on replacement of the 
human process by a machine process. For example, the most cogent question from 
this point of view is* how can a chess player use a machine to better his play, as op- 
posed to how can a machine substitute for him. One could hope that clever coupling of 
machine facilities to a skilled chess player might admit of interesting results not 
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achievable by chess- playing machine programs operating on their own. 

Under the general heading of "simulation and gaming 11 there have been a number of 
experimental studies in a wide range of areas, including military situations, machine- 
shop scheduling problems, warehousing operations, traffic flows, and refinery opera- 
tions, to name only a few. Simulation techniques provide experimental tools which 
may give new insights into problems already well formulated, may aid in the develop- 
ment of a model in not- so-well-formulated problems, may provide leads to the con- 
struction of reasonable criteria or satisfactory rules of thumb, or may permit consi- 
deration for human decision of a larger number of alternatives than would otherwise 
be possible. Simulation processes essentially provide an experimental laboratory 
which can bring some problems out to where human intellects can operate on them. The 
role of the computer in simulation is primarily that of referee and score keeper in po- 
sing situations and translating decisions to consequences. The wide diversity of simu- 
lation studies already in existence is sufficient to indicate the high and growing inci- 
dence of problems in which simulation is useful. 

Another hope for the future improvement of decision-making rests upon the further 
development of techniques for translating implicit assumptions, values, or prejudices 
of individuals (or groups of individuals) into explicit form for analysis of their conse- 
quences. As has been pointed out above there are often conceptual and other difficulties 
which block the completion of appropriate decision-making models. In some cases it 
has been possible to capitalize on the fact that people do indeed make decisions even 
when they shouldn't apparently be able to do so. An assumed underlying consistent 
structure may be brought out by getting answers for a set of simple hypothetical situa- 
tions. Once made explicit the underlying structure may be used to provide input para- 
meters in more complex situations. An example of this proposes solution of a difficult 
allocation problem [4] by using value estimates derived from responses of a "policy 
board" to specially designed miniature allocation problems. The argument is that 
whatever rationale the policy board might have used, it can be reflected in a corres- 
ponding rationale applied to the more complex problem. The approach is conceptually 
similar to the process that may be used to estimate a priori probabilities and utilities 
in statistical decision problems [ 5] of incomplete information. It is hoped that develop- 
ments of this kind will increasingly contribute to the solution of difficult problems, 
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sometimes involving otherwise indefinable notions, such as the value of a human life, 
for example. 

The most challenging area of research is at the same time the one with the greatest 
potential for future effect on decision-making. This is the area of "artificial intelli- 
gence", under which we group heuristic problem- solving methods, already alluded to, 
learning programs, pattern- re cognition programs, and self-organizing logical pro- 
cesses. These developments, at present in early Infancy, proceed in one way or an- 
other by analogy with what is understood of certain psychological processes, setting 
up inductive processes of diagnosis and formal manipulation of problem statements, 
or depending on random alteration and reinforcement, or borrowing variously from 
notions of perception, abstraction, or the like. la this way it is hoped to develop for 
certain tasks or classes of tasks machine methods or organizations which would not 
normally be found as a result of ad hoc effort* on the particular tasks. 

For concreteness 1 sake a few examples will be cited. Heuristic programming has 
been applied to the task of proving theorems in Boolean algebra and to the solution of 
problems in Euclidean geometry, among others, and has been proposed for numerous 
situations, including chess [ 6] . Random trial and reinforcement models of learning 
have been applied to simple tests [?] , and the Perceptron, a statistical separation 
model corresponding to self-organizing neural nets [8 ]has been successful in learn- 
ing to discriminate between visual patterns. Activities of this kind, generally recent 
in origin, are mushrooming at the present time; it can be expected that in a few years 
the overall effort expended in these directions will be impressively large. Obvious ap- 
plications for the future are automatic language translation, automatic preparation of 
abstracts of scientific articles, devices for reading printed characters, information- 
retrieval systems, etc. The actual potential domain of applicability is far greater thar 
is indicated by these examples. If researchers can successfully model a few powerful 
principles of evolutionary program development, corresponding to the interaction of 
environment, genetic structures, and natural selection, we may expect truly marve- 
lous consequences. 
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MOTIVATION FOR AN APPROACH 
TO THE SEQUENTIAL DESIGN OF EXPERIMENTS 
Herman Chernoff 

1. Introduction. 

Considerable scientific research may be characterized in the following way. A sci- 
entist is interested in a problem. He performs an experiment to obtain information. 
This information not only serves to illuminate the problem but is used to design a more 
informative experiment. As information is accumulated, he continues designing more 
and more effective experiments until he reaches the point where he feels that further 
experimentation seems no longer necessary. Then he announces his results. The pro- 
cedure outlined may reasonably be entitled sequential design of experiments. In spite 
of its apparent usefulness, there seems to be no formal theory of sequential design of 
experiments in the literature of statistics. 

In this paper I wish to report on the motivation for an approach to a formal theory of 
sequential design of experiments which is presented in [2] . 

For the sake of mathematical simplicity it is often effective to develop first a large 
sample theory and this is what I propose to discuss. It might seem peculiar to talk of 
a large sample theory in connection with sequential analysis which was originally deve- 
loped in order to make it possible to stop sampling after few observations if those ob- 
servations happened to be very informative. However the sample size in the standard 
theory of sequential analysis tends to be large if the cost of experimentation if small 
compared to the costs of making the wrong decisions. Thus we shall use the terms 
asymptotic theory and large sample theory synonymously to denote the theory applying 
to the case where the cost of experimentation approaches zero. 

The classical theory of statistics usually deals with two kinds of inference. These 
are testing hypotheses and estimation. For reasons we shall not discuss here, it seems 
that for estimation problems a large sample sequential design theory will not be sub- 
stantially different from a large but fixed sample size theory. Thus we shall confine 
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our future discussion to the problem of testing hypotheses. 

Let us consider the following problem which may serve as a prototype of the problem 
of the sequential design of experiments applied to testing hypotheses. It Is desired to 
determine which of two methods of manufacturing traveling wave tubes is more reliable. 
Suppose that p l is the probability that a traveling wave tube made by method 1 will 
meet the desired specifications and p 2 Is the probability that such a tube made by me- 
thod 2 Is satisfactory. Then the unknown state of nature is represented by (pj, p^). 
It is desired to adopt method 1 if the hypothesis H^: P I > p is true and to adopt method 
2 if the alternative hypothesis H : p < p Is true. Two experiments are available. 
These are E , test a tube made by method 1, and E , test a tube made by method 2. 
After each experiment, the statistician must decide whether to continue experimenta- 
tion or to stop. If he continues, he must decide which experiment to perform. If he 
stops he must select one of the two methods for adoption. 
2. Sequential analysis for large samples, 

We shall find It Illuminating to study a well developed theory of sequential analysis 
for the case of large samples. This is the theory of sequentially testing a simple hypo- 
thesis versus a simple alternative where there is no choice of experiments. For this 
problem, Wald developed the sequential likelihood- ratio test and later collaborated 
with Wolfowitz to prove its optimality (see [5J and [6] ). 

To be more specific let us suppose that an experiment is repeated many times, yield- 
Ing independent observations X, , X^, . . . , X , . . . . Let H specify that the observations 

1 2 n 1 

have density f^(x). Let H Z> the alternative hypothesis, specify density * 2 (x). A test is 
said to be a sequential likelihood-ratio test if there are two numbers A and B such that 
after the uP* observation the test dictates 

n tyr \ 

accept H, if S log r*^; > A, 

il *2**l' 

n f /v \ 

accept H 2 If S log l*g ' < B, and 

n f i (X* ) 

continue experimentation if B < S log *) *' < A. 



Note that experimentation Is stopped If H or H Is accepted. Also the name likelihood- 
ratio test derives from the fact that 
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TT f , (X. ) is called the likelihood of H given the data X. , X, , . . . , X . Hence 

il * l l l 2 n 

n 

S log [f . (X.)/f ? (X ) ] is the logarithm of the ratio of the likelihood of H. to the 

il l 

likelihood of H^ and is called the logarithm of the likelihood ratio. Now let us assume 
that c is the cost per experiment. Let r j and r be the costs attached to the two errors, 
reject Hj, (i. e. , accept H Z ) when H I is true, and accept Hj when H is true respectively. 
Then it is shown in [1] , [3] , and [6] that all admissible procedures are sequential 
likelihood- ratio tests. However this fact does not pick out a best choice of A and B. To 

do so let us assume a priori probabilities w. and w. for the hypotheses H, and H. re- 
~~ i & 12 

spectively (w. + w * 1) and let c - 0. 

For an arbitrary procedure, the risk or expected cost under H^ is given by 
(la) Rj -rjtt + cNj 

where a is the probability of rejecting H. when it is true and N. is the expected sample 
size when H is true. Similarly the risk under H_ is 

(Ib) *2 mT 2* + CN Z 

where ft is the probability of accepting H when it is false and N is the expected sample 

sice under H . 

If c is small, the optimum sequential likelihood -ratio test should tend to have large 
sample sizes. This is achieved by making A very large and B highly negative. As 
c 0, we expect A and -B to approach infinity. 

Let us continue heuristically. We shall suppose that A and -B approach infinity at 
comparable rates. Then applying Wald's approximations of [5] , we have 
(2) 



(3) N^A/Xj N 2 - 
where 

(4) l x - / log [ f 1 (x)/f 2 (x)] f 
and 

(5) Ij,- /log I f 2 (x)/f 1 (x)] * 2 (x)dx 

are the Kullback-Leibler Information numbers (see [4] ). Then 

(6) R & Tf** cA/Ij 

and 
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(7) 

Applying the a priori probabilities, the average risk is given by 

We shall approximate the optimal procedure, that is the optimal choice of A and B, by 
minimizing the approximations of *? obtained by substituting the approximations of 
equations (6) and (7). Setting the partial derivatives with respect to A and B equal to 
zero, we obtain 

(9) A=s log f I 1 w 2 r 2^ w i c ^ ** " l S c 
and 

(10) B log [w 2 c/I w 1 r j ] 5^ log c. 
For this procedure we have 

(11) a ~ c , fi~c 
(12) 

and 
(13) 

Before proceeding to apply these asymptotic results, several remarks deserve men- 
tion. 

1. The sequential likelihood- ratio test can be interpreted from an a posteriori pro- 
bability point of view. Applying Bayes theorem, if H and H have a priori probabilities 
w x and w 2 and the data X^ X 2 , . . . , ^ which yield the likelihoods JL ln and ^ are ob- 
served, then H I and H Z have a posteriori probabilities w ln and w, given by 



a ~ c, p < 
XT ^ - log c 



M ~ " log C 

N 2~_f- 



In w L +w_L,^ 
1 In 2 2n 



w 2 L 2n 



(14) 

and 
(15) 

and then 

(16) 

where 

(17) X n . ^ 

is the likelihood ratio. Then a stopping rule which calls for stopping when 
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log X n J log c is equivalent to one which calls for stopping when one of the a poste- 
riori probabilities is roughly of the order of magnitude of c. 

2. The information number 1^ is the expectation under H of log [fjfXJ/f^X)] which 
is the logarithm of the likelihood ratio corresponding to one observation. Thus I, mea- 
sures the rate at which S log [f.(X.)/f.(X.)] tends to increase under H,. This ex- 

l-l 1 l 2 l l 

plains why l^ appears in the approximation of equation (3). Similarly 1^, the expecta- 
tion under H of log [f^XJ/f^X)] a . log [ ]L (X)/f 2 (X) ] , measures the rate at which 

n 

S log [f (X.)/f (X.)] tends to go toward -B when H. is the true hypothesis. Since 
i-1 * Z 

the risks R^ and R^ for the optimal procedure are approximately inversely proportion- 
al to I and I respectively, high values of I. and 1^ are desirable. Incidentally, Ij and 
I. are known to be non-negative (see [4] ). 

3. Since the probability of accepting the wrong hypothesis is roughly of the order of 
magnitude c and the expected sample size is of the order of magnitude of -log c for the 
optimal procedure, it follows that the main contribution to the risk is made up of the 
cost of sampling. 

4. It is important to note that the optimal procedure and its risks i.e. , A, B, R^, 
and R ? are relatively insensitive to changes in the particular values of the arbitrarily 
chosen a priori probabilities w^ and w^ and to the values of r^ and r^. This is indi- 
cated by equations (9), (10), and (13). 

5. The above argument which consists of approximating the A and B for which ft 
is minimal by the values of A and B which minimize the approximatibn to /? is not ri- 
gorous. However a rigorous and detailed proof of the above results can be derived us- 
ing Wald's bounds in [5] . 

3. The design problem for deciding between two simple hypotheses. 

Having discussed the simplest sequential testing problem from the large-sample 
point of view, let us proceed to introduce a design element into it. First suppose that 
there are available two equally costly experiments Ej and E^. Suppose that the experi- 
menter wishes to test a simple hypothesis H^ versus a simple alternative H but that 
he must restrict himself to the use of one of these experiments exclusively. If he se- 
lects and proceeds thereafter in an optimal fashion, his risks under H^ and H^ will 



be Inversely proportional to Ijf*^) ** I^E^. Hence if IjfEj) > IjfE^ and ^(Ej) 
> ME ) it would obviously pay for him to select the experiment E^ 

Suppose however that I,(E I )> I^Bj) but ^(E^ <I 2 (E 2 ). In this case Ej is better if 
H I is true and E^ is better if H Z is true. If we knew that H^ were the true hypothesis, 
we could use E. . At first glance the last sentence sounds foolish. If we knew that H. 
were the true hypothesis, we would not bother experimenting at all. At this point the 
large sample or small cost aspect of the problem becomes important again. If c Is 
very small, it may pay to continue experimentation even though the experimenter were 
almost certain that H, is true. Thus if we extend our problem to the one where we may 
select an experiment after each observation, it would make sense to select Ej if the 
previous data favored H, strongly and to select E_ if the previous data favored H, 
strongly. More generally, if several experiments were available, and the data favored 
H. strongly, it would seem reasonable to select the next experiment E so as to maxi- 
mize I^E). 

For the case where c is very small, and very large samples will be called for, it 
isn't terribly important what is done for the first few observations. Thus we may now 
'propose a procedure for sequentially testing a simple hypothesis versus a simple al- 
ternative where several experiments are available. This procedure is suggested by the 
results of section 2 and the above discussion. 

Let f x (x, E) be the probability density of the data under H I if experiment E is used. 
Similarly let f 2 (x, E) be the density of the data under Hg. Let E^ be the experiment 
used for the I th observation and let Xj be the 1 th observation. Then 

(18) L. - ft M X i E(l) ) 

n i-1 
and 

n 

(19) L. - TT f 2 (X.,E <l >) 

il 
are the likelihoods of H^ and H^ based on the first n observations. The procedure sug- 

gested calls for the following after the n*** observation. Stop experimenting and accept 



n 
(20) log X log JH S ... > - log c. 

^n i.l l 

Stop experimenting and reject H if 
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L ln n 
(ID ^ g X n -log - i ^ log -- < log c. 



Otherwise continue experimenting and select E* 11 " 1 " as follows. If L > L , act as 

In 2n 

though H were true to the extent of choosing E' n ' to be that experiment E which 
maximizes I^E). If L 2n > JL ln , select E* n+1 ) to maximize I (E). 

In this procedure we have made use of section 2 to suggest stopping when the likeli- 
hood ratio is roughly of the order of magnitude of c. We have applied the discussion of 
section 3 to suggest acting as though the more likely hypothesis were true in selecting 
the next experiment. 
4. Composite hypotheses. 

In the preceding section a procedure was suggested for the case of testing a simple 
hypothesis versus a simple alternative. The prototype example mentioned in section 1 
is more complicated. Here the hypotheses are composite. That is to say the hypothesis 
and the experiment do not uniquely determine the distribution of the data. Thus, in the 
prototype example Hj can be true with p j . 6 and p 2 * . 4 or with p 1 .45 and p 2 . 43. 
These two cases will lead to different distributions for the data. What is suggested by 
our previous discussions for this case? The use of the likelihood ratio for the stopping 
rule is easily generalized. However the problem of the choice of experiment seems to 
call for some additional concept. 

To study this problem, let us think of the simplest case involving composite hypo- 
theses. Suppose that the distribution of the data is determined by the experiment and a 
parameter 0. In general we denote Hj, and H 2 by 

(22) H I : o^ and H 2 : w 2 X 

where u> 1 and u> 2 are two sets with no points in common. The simplest case involving 
a composite hypothesis is that where o> ^ consists of one point and w consists of 
the two points 0. and v Tk n we are testing the simple hypothesis Q 0, versus the 

composite alternative Qm Q or 9*. In our discussion of this case let us use the a priori 
e, j 

and posteriori probability point of view. Hie a priori probabilities of Op O^and Q^ 
will be denoted here by w^, w_, and w^. The posteriori probabilities after n observa- 
tions will be denoted by w ln , w 2n and w^. The first remark near the end of section 2 
relates the a posteriori probabilities to the stopping rule. It seems natural to stop 



1 The symbol denotes "is an element of. " 
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when either w, , or w, + w,^ is of the order of magnitude of c. 
In 4n sn 

Suppose now that H were the true hypothesis. Then it would be desired to have 
W 2n * W 3n become 8ma11 rapidly- Let us study how fast w 2n and w 3n approach zero 
when an experiment E, which yields data with density f(x, 0, E), is applied. As in sec- 
tion 2, we have 

n 

Tr0v 2 .*> 

^2n , ^2 . ^ 



i 1 



w. 
(23) 



w. 



where 



n f(X,0 ,E) 
(24) S _ S log L - J 



When 0=0, each term of the sum S has expectation given by 
1 2n 




(25) I(0 r 2 . E) . / log . E) dx. 

Similarly 



where 

n f(X.,0 , 



(27) 



each term of which has expectation given by 

rf(x,0 lf E)1 



As n - oo , S^^ and S Jn approach infinity at rates determined by I(0j, 2> E) and 

1(0. , , E). Then w^ n approaches 1 and w 2n and w^ n approach zero at exponential rates 

determined by 1(0^ 9^ E) and 1(0^ 3 , E). 

Suppose that 1(0 j, 0^, E) > 1(0^ 3 , E). Then w 2n approaches zero much faster than 
W 3 n ' In f&ct the rate at which w_ + w^ approaches zero is then determined by 
I(0 1 , 3 , E), i. e. , the smaller of 1(0^ Q^ E) and 1(0^ Q^, E). Thus the experiment which 
would be most effective would be the one which maximizes 
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min {I(9 1V 9 2 ,E), *C$i C 3B)} 
This suggests a general procedure for selecting the experiment. First let 

(29) K, + , B) - / log 3=4^ | f(x, . E) dx 




Also we define the space alternative to by 

a(0) * o> 2 if Q w and 

(30) a(Q) . W]L if C uy 

In our problem, the space alternative to Q was w - fc 2 '3$ whlch corresponds to 
the hypothesis alternative to 0^. Similarly the space alternative to is w { }. 

After n observations compute the maximum-likelihood estimate of Q. Act as 

n 



though n were the true value of 9 by selecting the next experiment E* to make as 
large as possible the smallest of the 



for all <p In *(^ n ). In other words select E^ n+ *) to maximize 
min I(0 n 4> E). 



The procedure of maximizing a minimum is reminiscent of the solution of a two-per- 
son zero- sum game. It corresponds to the behavior of a player selecting to maxi- 
mize a "payoff" I(C n , $ , E) when he has an opponent who will react to his choice with 
the worst possible alternative 4> . 

The theory of games tells us that frequently a player can improve his position by 
using randomized strategies. What does a randomized strategy represent for an ex- 
perimenter? If one were to use a table of random numbers to select an experiment, 
this choice could be considered a randomized experiment. An example of a randomized 
experiment is E which consists of performing Ej with probability .4, E Z with probabi- 
lity . 5, and E S with probability .1. It is a fact that the randomized experiment E which 
selects E X , E Z , . . . with probability p lf P 2 , . . . will yield 
(31) K,* f E>- 



Then, as in game theory, broadening the class of available experiments to include the 



To select the experiment with a table of random numbers, one may take E,, if a ran- 
dom digit chosen from the table is 0, 1, Z, or 3; take E 7 if the digit is 4, 5, 6, 7, or 8; 
and take E 3 if the digit is 9. 
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randomized experiments occasionally has the effect of enabling the experimenter to do 
somewhat better In maximizing 

mln A I($ n , *,E). 
*a($ n ) 

5. The general procedure and Its properties. 

In section 2 we Indicated that the stopping rule should lead to stopping experimenta- 
tion when the logarithm of the likelihood ratio Is of the order of magnitude of -log c 
or equivalently when the a_ posteriori probability Is roughly of the order of magnitude 
of c. In section 3, we Indicated that the experimenter should act as though his estimate 
of the parameter were the true value In selecting the next experiment to maximize the 
Information. In section 4, we extended this notion to Indicate how to maximize the In- 
formation for testing composite hypotheses. Here the experimenter acts as though he 
were playing a game against an opponent selecting an alternative 4> . 

Let us summarize these ideas and present a general procedure whose properties we 
shall discuss. Let $ n be the maximum- likelihood estimate of 0. Let C n be the maxi- 
mum-likelihood estimate of Q when 9 Is restricted to a(0 n ), the space alternative to 
Q n . Then 

(32) Sn-logXn- S log [ffXj, C n , E (l) )/f(X.,2 n , E (l) )] 

Is 1 

Is the logarithm of the generalized likelihood ratio. Our procedure tells us to stop and 
accept the hypothesis corresponding to C n if S^ -log c. Otherwise we must select an 
(n+l) 8t experiment E* n '. This Is to be that randomiied experiment which maximizes 



To discuss this procedure let 

(33) 1(9) - max min 1(0,0 . E) 

E *a() 

where the maximum Is taken with respect to the set of randomized experiments. In 
[2] It Is shown that If there are a finite number of states of nature and a finite number 
of (nonrandomlzed) experiments available then as c - 0, the risk or expected cost 
R(0) for using this procedure satisfies 



Furthermore this procedure is optimal In the following sense. For a procedure to do 
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better for some 0, i.e., to decrease R(0) by some factor for some Q, it must do worse 
by an order of magnitude for some other 0. In other words suppose that there is an- 
other procedure which yields risks R*(0), and there is a Cj such that 



for small c. Then it cannot be the case that R*(C)/R(0) is bounded away from for 

all 0. 

6. Miscellaneous remarks. 

1. The asymptotic- optimality of the procedure of section 5 may not be especially re- 
levant for the initial stages of experimentation, especially if the cost of sampling is not 
small. At first it is desirable to apply experiments which are informative for a broad 
range of parameter values. Maximizing the Kuilback- Lei bier information number may 
give experiments which are efficient only when Q is close to the estimated value. 

2. It is clear that the methods and results apply when the cost of sampling varies 
from experiment to experiment. Here we are interested in selecting experiments 
which maximize information per unit cost. 

3. Mr. Stuart Bessler has generalized the results of section 5 to cases which involve 
selecting one of k mutually exclusive hypotheses, and where there are infinitely many 
experiments available. 

4. The asymptotic study of the problem of testing a simple hypothesis versus a 
simple alternative suggests that it should be possible to refine the stopping rule for the 
composite problem. While the main term of the risk should not be affected the higher 
order terms could probably be improved. Such improvement may be quite important in 
the case where c is not very small. A refinement in the stopping rule would be relevant 
for problems of testing composite hypotheses even if the problems do not involve the 
choice of experiments. 

5. Mr. A. E. Albert has generalized some of the results of section 5 to the important 
case where there are infinitely many possible states of nature. 
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SOME PROBLEMS CONCERNING THE CONSISTENCY OF 

MATHEMATICAL MODELS 

J.L. Doob 

I shall describe this morning various criteria used in adopting a mathematical 
model of an observed stochastic process. The observed process assigns to certain 
instants of time t a corresponding number. The number may be the number of cars 
that have passed some intersection by time t, the number of telephone calls that have 
been initiated by time t, the insurance payable to accident victims by time t, etc. 
What kind of mathematical model should one make in such situations? 

For example, consider the number of cars that have passed a given point by time t. 
The first hypothesis is a typical mathematical hypothesis, suggested by the facts and 
serving to simplify the mathematics. The hypothesis is that the stochastic process of 
the model has independent increments. That is, if x(t) is the number of cars that have 
passed by time t, and if t 1 < <t^, then the random variables x(t 2 ) - x(t 1 ), . . . , 

x(t ) - x(t ) are mutually independent. Roughly, this hypothesis is that future and 
n n-1 

past traffic are mutually independent. 

The next hypothesis, that of stationary increments, states that, if s <t, the distri- 
bution of x(t) - x(s) depends only on the time interval length t - s. This hypothesis 
means that we cannot let time run through both slack and rush hours. Traffic intensity 
must be constant. 

The next hypothesis is that events occur one at a time. This hypothesis is at least 
natural to a mathematician. Because of limited precision in measurements it means 
nothing to an observer. (We may, if we wish, define a new kind of event, consisting of 
simultaneous occurrence of one or more events of the old kind. Then we will have only 
one of the new kind occurring at any time. ) 

The next hypothesis is of a more quantitative kind, which also is natural to anyone 
who has seen Taylor's theorem. It is that the probability that at least one car should 
pass in a time interval of length h should be ch +o(h). Here c is a positive constant 
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ind o(h) means a quantity small compared with h when h is near zero. This hypothesis 
a usually coupled with the hypothesis, related to one already made on simultaneity, 
hat the probability is o(h) that more than one car passes in a time interval of length h. 
We can of course keep adding hypotheses, for example demanding that the number of 
:ar passings in an interval of length h have expectation ch. At some stage we would 
>egin to wonder whether all these hypotheses are mutually consistent, and whether 
tome imply the others. In this case it turns out that all of the above conditions are 
autually compatible. 

Note the different character of the various conditions. The overall one is that a 
robability model is appropriate. Independence and static narity of increments are 
ualitative; the others are quantitative. After imposing such conditions on a mathe- 
matical model, possibly unnecessarily many, one must prove that there is a model 
ctually satisfying these conditions. In this particular case there is, the Poisson sto- 
bastic process, in which if s <t, x(t) - x(s) has a Poisson distribution with mean value 
multiple of t - s. A great deal of work has been done in this area. The Poisson pro- 
ess arises frequently in this simple form and in more complex forms, in insurance 
roblems and telephone engineering. 

A more complex model is obtained by supposing that a system is under investigation 
hich can take on various states numbered 1, 2, .... Define P^ft) as the probability 
At if the system is in state i at time r, then it will be in state j at time r + t. The 
rm given to this transition probability assumes stationarity, in that there is no de- 
mdence on r. Under the further assumption that the conditional probability just 
tscribed does not change if the states at times prior to r are given (Markov propertyX 
e probability relations of the process are determined, up to an assignment of an 
Ltial distribution of states. The question now becomes that of the determination of 
e transition probability functions. The hypothesis that is usually made here is that 
e probability of a transition from state i to state j in a small time interval, t, has 
e form (again suggested by Taylor's theorem), 



s suppose that the probability of remaining in state i for small time t is near unity, 
d thus we assume 



28 



(2) P u (t)-l + tq ll +o(t). 

where the q 's can be evaluated from physical considerations. Obviously q^ for 

i^ j, and q <0, and the condition SP..(t) 1 suggests the condition 2 q... 0. 
ii j IJ j -Hj 

The next question is: what kind of model is obtained in this case? If we are given the 
q and q^'Sf these tell us the probability, neglecting terms of higher order, of mak- 
ing a specified transition in a small time. Does this imply that given any set of num- 
bers, the q 's and q 's, we can obtain a unique transition probability matrix system 
satisfying (1) and (2), or do we have to impose other hypotheses? 

If there are only finitely many states it turns out that there is one and only one set 
of transition probability functions corresponding to the q.. matrix. If we are dealing 
with a system in which there are infinitely many states, then, (as is known from a 
good deal of work in the last decade or so), just assigning q 's and q 's with the ob- 
vious relations between them is not enough. We must do more, since if the q 's be- 
come large as i varies we no longer have a simple case, and encounter many mathe- 
matical difficulties. A sufficient condition for uniqueness is that the q.* sequence be 
bounded. 

Consider now an example of a different type, Brownian motion, encountered in many 
discussions of noise phenomena and molecular and atomic phenomena generally. It is 
observed that microscopic particles in a fluid undergo spontaneous irregular motion. 
It was surmised early, and reasonably verified much later, that this motion was due 
to the impact on the microscopic particle by groups of molecules hitting it on the same 
side. The particles themselves are so much bigger than the molecules of the fluid, 
that a molecule hitting a particle would have little effect. But if a "large" number of 
them happen to hit the particle on the same side, then it will move appreciably, while 
if they hit on the other side it will move in the opposite direction. Let us see what 
mathematical model we can construct to describe this physical situation. 

Let x(t) be the x coordinate of the moving particle at time t. (If we want to solve the 
problem in all three dimensions simultaneously we could think of x(t) as a vector). It 
is not unreasonable as a first approximation to suppose that the x(t) process has inde- 
pendent increments. In this situation, independent increments means that the displace- 
ments corresponding to disjoint time intervals are independent, because of the "fact" 
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that the molecules hitting the particle in any one time interval bear no relation to those 
hitting the particle in any other disjoint time interval. We observe that this hypothesis 
is plausible if the time intervals are not too close to each other. However the hypo- 
thesis of independent increments does not have this latter restriction. 

The next hypothesis, more reasonable in this case than in that of the traffic example, 
is that of stationary increments. 

Another natural hypothesis is that the particle trajectories are smooth. Use of the 
word smooth may bring in ambiguities and therefore it has been subject to many argu- 
ments. If we are thinking of these paths as trajectories of particles, then what is 
meant by smooth is obvious at any particular time. However, some functions which 
we are at present willing to think of as smooth functions would have been thought of one 
hundred years ago as functions with very jagged graphs. What we might think of, per- 
haps, is the minimum hypothesis of smoothness; i.e., that the trajectories are con- 
tinuous. In other words the particle does not suddenly jump from one position to an- 
other. 

Going further in the same direction, it might seem reasonable to suppose that the 
particle trajectories have first and second time derivatives, (that is, that the x(t) 
sample functions have these derivatives). In physical language this means that the par- 
ticles have well-defined velocities and accelerations. It has been shown, however, 
that the hypotheses of stationary independent increments and continuous trajectories 
suffice to determine the process up to two constant parameters. The displacement 
x(t) - x(s) is necessarily normally distributed with mean zero and variance proportion- 
al to |t - s | . This means that it is both unnecessary and risky to make hypotheses 
about x'(t) and x"(t). In fact, it turns out that in this mathematical model these deriva- 
tives do not exist! That is, the sample functions of the x(t) process (we are consider- 
ing the mathematical model only) are continuous but do not have derivatives. 

Here we have reached the case in which our mathematical model has outrun the na- 
tural hypotheses. When this was discovered, it was concluded that "anybody could see" 
that these particles had tfce most extraordinarily wild oscillations, and so it just looked 
plausible that the x-coordinate functions of the particles were examples of continuous 
functions which do not have derivatives. It was then assumed that this was a "real-life" 
example of degenerate functions that are continuous, but have no derivatives. This 
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was obviously a rationalization, of a type which may be seen frequently in science, 
even though it does not look like a rationalization until the next stage, when it is shown 
up for the nonsense it is. 

In a later refinement of the mathematical model, due to Ornstein and Uhlenbeck, 
the hypothesis of independence of increments was dropped, and in this model the 
sample functions have first (but not second) derivatives. Perhaps scientists had 
learned the lesson not to take the mathematical model too seriously, for at least now 
nobody said that you could see by looking at the particles that they had velocities, but 
did not have accelerations. 

The progress from the first to the second model illustrates the fact that no model 
can hope to reflect all of reality, and extremely delicate properties of the model can- 
not be taken too literally as direct reflections of reality. The next example also illus- 
trates this. 

Consider a simple pendulum. If is the angle that the string makes with the verti- 
cal, 




it turns out, according to the general principles of statistical mechanics, that (t), 
the value of the angle at time t, is a random variable which is not identically constant 
even in the absence of external forces. The mean value is and the variance can be 
computed using the equipartition principle. In the standard mathematical model, Q(t) 
has normal distribution, and this means that 0(t) can take on any value from - * to 
+ . This can be interpreted to mean that if we wait long enough the pendulum will 
not only move perceptibly to the naked eye but even go around across the topi This is 
the kind of delicate result inherent in the model that need not be taken seriously. This 
is an example that shows that when one constructs a model one may get something 
more than was bargained for and thereby a limitation on the meaningful ness of the re- 
sults. One must always distinguish between mathematical and empirical concepts. 
For example, in the Brownian motion, one cannot talk about the existence of deriva- 
tives of the sample functions of the mathematical model. But physical experience and 
experimentation do not produce mathematical functions, and it is therefore not proper 
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to apply such mathematical words as "continuity" and "derivability" to empirical 
sample f unctions 

It is of interest to give some conditions insuring continuity of sample functions, 
since they are not very well known here, and have appeared in other languages (Russ- 
ian in particular). From these conditions it will often be possible to draw the proper 
conclusions concerning certain mathematical models. 

The first condition is due to Kolmogorov, who gave it in 1937. What interested him 
was under what conditions continuous paths were ensured. His condition was simply 
that there exist >0, a> such that for h > sufficiently small 

(3) E { | x(t + h) - x(t)| a I * const h 

for all t. 

Kolmogorov showed that under these conditions if we have a stochastic process with a 
given distribution, then the sample functions are continuous, (or at least we can get a 
model having the same distribution for which the sample functions are continuous). 

For example in the Brownian Motion process x(t + h) - x(t) is a normally distributed 
random variable with mean and variance proportional to h. It can be verified that 
here (3) is satisfied, and thus the Brownian Motion process has continuous sample 

functions in the above sense. 

v 

Considerably later Centsov (1956) proved the following result concerning oscillatory 

discontinuities of sample functions. (Discontinuities can be divided into two types; the 
type in which there is a limit on each side, called a jump discontinuity, and the other 
type, the oscillatory discontinuity.) There will be no oscillatory discontinuity if there 
exist a, , < , all exceeding zero, such that for h^ h >0 sufficiently small 

(4) E | [xCO-xCt+h^j* |x(t)-x(t-h 2 )| / * J const (h^h^ 1 ** for all t. 
Finally Dobrushin (1958) obtained conditions under which no jumps would be present. 

His condition is that there should be c > such that 

(5) sup P f|x(t+h)-x(t)| > I o(h) 

for aU h sufficiently small. 

It should be noted that there are processes which have jump discontinuities, but no 
oscillatory ones. For example the Poisson process described at the beginning of the 
.ecture. 
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Suppose we have a process x(t) with the property that 

E[x(t) | X(T), T*T o <t] X( TO ). 

For any such processes, which are sometimes called semi-martingales, we know a 
priori that there are no oscillatory discontinuities. This means that a qualitative hypo- 
thesis of this sort will yield at once certain properties of the sample functions. Such 
broad qualitative assumptions furnish specific properties of the sample functions aris- 
ing from the model, these being rather delicate properties. We must be careful, as 
the Brownian Motion example demonstrated, to avoid contradictory assumptions in the 
mathematical model. This is at least one justification of the mathematician's standard 
goal of using the fewest possible assumptions. The fewer assumptions the smaller the 
likelihood of contradictions. 
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SEQUENTIAL DECISIONING 
Merrill M. Flood 

Introduction. 

Making decisions is hard work. Managers who are faced with too many difficult de- 
cisions too fast often suffer physically and mentally. Yet, men who are believed to be 
capable of making good decisions an unusually high proportion of the time are apt to be 
well rewarded as executives or leaders. We extoll the traits of coolness and objectivity 
under stress, and of crisp but considered judgments when time is fleeting. It seems no 
wonder, in this age of rapid communication and transportation, that managers and 
scientists are striving to find mathematically rational systems to take over at least 
some of the load of stressful decision-making in a divided world . Perhaps better de- 
cision! ng systems of this kind will even improve our mental health as well as our pro- 
ductive efficiency. 

No pretense is made here that the mathematical scientists have solved the decision- 
making problem indeed they have barely started on its formulation. However, there 
is a rapidly growing and impressively solid scientific literature that deals mathemati- 
cally with various aspects of the familiar problem of choosing well between alterna- 
tives { l] . All of this work is lumped here under the ungrammatical name "decisioning 
science. 11 Our concern will be with a few such concepts and techniques that mathema- 
tically minded workers have introduced in recent years. 

We are indebted to John von Neumann (1928) [2], for an especially clear insight into 
the fundamental nature of the problem of making a wise choice among alternatives when 
there is total uncertainty concerning the likelihood of the possible outcomes. These ba- 
sic ideas were later expanded by von Neumann (1944), in collaboration with economist 



This is a revision of a paper presented at Arden House on June 13, 1956, at the 
Seventh Annual Industrial Research Conference sponsored by the Department of Indus- 
trial and Management Engineering of Columbia University. 
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Oskar Morganstern, to provide a mathematical formulation of the broad problem of 

[3] 
social and economic behavior. Their Theory of Games and Economic Behavior is now 

a classic in the decisioning field, and has provided the main stimulus for the intensive 
development of mathematical decisioning science since World War II. 

We are concerned here with concepts, rather than with mathematical details, and 
must venture the impossible a description of the central concepts of game theory 
[4] , and of other mathematical decisioning theories, briefly and in reasonably simple 
non mathematical language. As is typical of the mathematical sciences, any hope of 
understanding a new concept is very apt to require some understanding of a few older 
concepts. Here we must assume an adequate understanding of certain older concepts 
and mathematical systems, such as the calculus of probability, in order to get on with 
the newer ones -- and do so unblushingly, at the risk of being uncommunicative or 
even misunderstood. Ample references to fuller treatments are included for those 
who may wish to explore any of our topics further. 
Game Theory and Statistical Decision Theory. 

We can illustrate the central idea of game theory by first posing a simple type of 
management question, and then showing how von Neumann's minimax principle might 
help to resolve the matter. 

Question. How difficult should the first $64, 000 question have been made? 
Situation. The rules are changed slightly from those of the famous $64, 000 
question television program. The guest is permitted to try to answer the 
question even if he chooses not to go on for $64, 000, and without penalty if 
he misses, for a bonus prize (cost $4, 000 to the sponsor) if he gives the 
correct answer. He receives only this same prize if he goes on and fails. 
Data. The sponsor estimates that it will be worth essentially the same to 
him in advertising value whether or not the guest answers correctly, unless 
he does go on to try for $64, 000. If he does go on, then the sponsor esti- 
mates that a winning answer will be equivalent to an increase of $64, 000, 
and a losing answer equivalent to a decrease of $64, 000, from the overall 
advertising value estimated otherwise for the program. The sponsor also 
has decided that he is totally unable to estimate the odds against the first 
guest choosing to go on to try the $64, 000 question, and that the future of 
the program will in no way be affected by the difficulty of this first question. 
Analysis. The net incremental costs to the sponsor, over and beyond those 
derived from the program independently of this particular decision, may be 
represented in a four way matrix as follows. 



Outcomes 





Answers 
Correctly 


Answers 
Incorrectly 


Goes on 


64 minus 64 


4 plus 64 


Stops 


32 plus 4 


32 



Costing Matrix 
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Minimax Answer. Game theory shows how the sponsor can limit himself 
to an expected cost amounting to $34, 000 simply by tossing a fair coin to 
choose between a very easy and a very hard question or, equivalently, by 
choosing a question of medium difficulty that he feels the guest is just about 
as apt to miss as not. Indeed, if the sponsor uses a question of medium 
difficulty it no longer matters to him whether or not the guest chooses to 
go on. This is an application of the minim ax principle of game theory. 

A more general situation would be one in which there were more than two possible 
outcomes, but not necessarily the same number of possible choices as there are pos- 
sible outcomes. The costing matrix would now show a separate value for each choice- 
outcome possibility. The miniznax theorem of von Neumann assures us that in the gen- 
eral case there always exists at least one set of percentages, with one for each choice 
possibility, such that a choice made on the basis of these percentages ensures that 
the expected cost will not exceed a certain definite maximum value. Since, on the 
other hand, the minim ax theory also assures the existence of a set of percentages cor- 
responding to the outcome classes that ensures an expected cost at least as great as 
this definite maximum value, and an even greater possible cost if the minimax choice 
percentages are not used, it seems reasonable to follow the minimax principle. J.D. 
Williams gives a useful and amusing set of examples of such "games, " with real life 
interpretations, in his very readable The Compleat Strategyst [4] . 

Just in order to show that the von Neumann percentages (Williams calls them "odd- 
ments") may not be too easy to calculate for a larger game, but nevertheless do exist, 
consider Example 20 from The Compleat Strategyst. This example is concerned with 
the problem met by a physician in prescribing one of three medicines for a patient in 
the face of uncertainty about which of five strains of bacteria is causing his illness. 
The costing matrix entries are the chances that the patient will be relieved by a parti- 
cular medicine-strain combination, the medicine to be chosen by the physician and the 
strain by "Nature. " The costing matrix and minimax percentages are as follows: 
Strain 





1 


2 


3 


4 


5 




1 


0.5 
1 


0.5 



0.5 


0.5 






0.67 

1 o 1 


3 


Q 






o 


0.67 


| 1 
0.33 


0-33 1 



Physician's 
Percentages 



Nature l s Percentages 

Under the minimax principle, in this case, the physician would toss a six-sided die 
and prescribe medicine 1 if 1, 2, 3 or 4 spot, appeared but medicine 3 otherwise. This 
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would ensure odds of 1 to 2 that the patient would be successfully treated, and the phy- 
sician could not guarantee better odds than these for the patient by any other manner 
of choosing unless he could somehow obtain further information about the situation. 
Sequential Decisioning. 

It is apparent in the $64, 000 question example that several quite unsatisfactory 
assumptions were made --so unsatisfactory, indeed, that it would be perfectly proper 
to question the usefulness of the entire approach in this case. The example was chosen 
to serve as a kind of straw man to be attacked in order to bring out some of the con- 
ceptual flaws in the game-theoretic model. 

First, it would be very hard to make accurate enough estimates of the 
true overall worth to the sponsor of the advertising effects under each 
choice-outcome pair. Unfortunately, this is usually a barrier to the 
successful use of game theory in practical situations. 

Second, it was a very unrealistic assumption that the reaction to the first 
program would not affect subsequent results. A common device used in 
an effort to avoid this difficulty is to solve the problem from the outset 
for a whole sequence of such programs, but this attempt usually fails 
because of the extreme computational difficulties encountered as the 
model is extended in this way. 

Third, it is a little too pessimistic to accept the assumption that 
Nature will do her expert best to thwart the person making the choice. 
As Einstein said, Nature seems deep but not malicious* 
Fourth, it is a bit too severe to restrict the available choices definitely 
and finally to any particular set; there is always at least one other choice, 
even if it is nothing more than to seek other alternatives before deciding. 
Fifth, there are many other objections that can be raised in connection 
with the von Neumann game-theoretical model, but they will not be listed 
here. Suffice it to say that the model is exceedingly suggestive for further 
work but rarely adequate in a practical situation. 

Another kind of formulation for the decisioning problem is typified by the "two-armed 
bandit problem. " One of the many special forms of this problem is as follows: 

A gambler has paid one dollar for the privilege of operating a "two- 
armed bandit" ten times, and with the right to pull either of the two arms 
on each successive play. He knows that the machine is so constructed 
that he will either get no return on any particular play, or will get two 
dimes back, and that the odds for such success remain constant for each 
arm during the entire course of ten plays. Finally, he knows that the 
actual odds for each arm were established independently and in such a 
way that each possible set of odds is equally likely for an arm. The 
gambler's problem is to make each successive play in such a way as to 
maximize his expected total return. 

So far as I know, this problem has not yet been solved. Or if it has been solved for 
two arms and ten plays, it certainly has not been solved for many arms and plays. 
Yet, this simple appearing problem has many of the key features of problems that are 
met in common decisioning situations: 

A sequence of choices is made between alternatives of a fixed set, 
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each choice leading to an observed success or failure. The underlying 
mechanism seems unchanging over a long enough period of time so that 
recent past experiences with it should be a safe guide to future results. 
How can successive choices best be made? 

Herbert Rob bin a [5] has discussed several variants of the two-armed bandit prob- 
lem, as a topic in the foundations of science, or of statistical inference, and has re- 
marked that ". . . the problem represents in a simplified way the general question of 
how we learn -- or should learn -- from past experience ... . " More recently, Bradt 
and Karlin [15] have treated certain special cases of the problem. 

Several useful techniques for sequential decisioning are now available, although none 
of them is fully adequate for any substantial class of problems commonly met in indus- 
trial management and control situations. Among these are: 

1) The Box- Wilson procedure [6] , already successfully used in the field of 
chemical experimentation. 

2) Stochastic approximation procedures, such as the Robbins-Monro [7] and 
Kiefer-Wolfowitz [8] processes, and generalizations and extensions due to 
A. Dvoretzky [9] , J.R. Blum [10] , and others [11] . 

3) Stochastic game-learning models of the present author [12] , and their 
psychological counterparts as discussed by Bush-Mostelier [13] and others. 

4) The dynamic programming techniques of Richard Bellman [ 14] and re- 
lated mathematical programming models. 

All of these more recent sequential decisioning models attempt to avoid most of the 
five basic objections to the von Neumann game- theoretic model that we listed earlier 
in this paper. Unfortunately, each such attempt has flaws comparable in seriousness 
to those for game theory. Examples will perhaps best serve to illustrate the kinds of 
applications considered for some of these recently developed sequential decisioning 
techniques, and also to show some of their good and bad features. Only the Box- Wil- 
son and Kiefer-Wolfowitz procedures will be discussed here. 
Some Sequential Games. 

There are a number of interesting and important results pertaining to sequential 
games. A few of these will be listed here before going on to the sequential decisioning 
examples. These results indicate the kind of mathematical treatment that is some- 
times possible, and for problems that at first formulation seem quite intractable. On 
the other hand, it is usually surprisingly difficult to extend such results beyond rela- 
tively simple special cases. 

The theory of games of timing [ 16] and of games of partitioning [ 17] has been de- 
veloped quite extensively during the past ten years. One of the simplest of these games 
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is the "single- shot noisy duel. " 

In this duel problem it is supposed that two combatants approach each 
other, starting out of range of each other, with each having the right to 
fire his single-shot weapon at any time during the approach. It is also 
supposed that each knows the other's chances of a hit at each range, that 
a shot can be heard by both, and that if either combatant fires and misses 
then the other will certainly approach close enough to be certain of hitting 
him. The problem is to give a rule for choosing the range at which to fire. 
The answer is essentially that each tries to fire at the first range when 
the sum of their individual probabilities of hitting is unity. 

Various generalizations to several-shot cases, including cases where some of the shots 
may be silenced, have been posed and solved. The results have found important appli- 
cation in armament design for aircraft and other military vehicles. At least one ap- 
plication has been made in industry, with regard to the optimal timing of release for 
the catalogue of a mail order company. My first contact with this class of problems 
was in 1949* when I proposed the single- shot noisy duel model as a possible one to use 
in estimating the likely time for an agressor to start a war; in this case, the probabi- 
lity of winning the war would be taken as a function of progress in the armament race. 
Here, as with other variants of the von Neumann 2- per son constant- sum game model, 
the difficulties are in choosing realistic values for the payoff functions, while the ma- 
thematical determination of the optimal strategies is only computationally difficult. 

Another sequential game that has a neat mathematical solution, and one that is some- 
what surprising, is one that I shall call the "fiance problem. " This may be played as 
a 2- per son zero- sum parlor game according to the following rules: 

Players A and B start with a perfectly shuffled deck of N cards. The 
cards are numbered from 1 through N on their faces. Player B is re- 
quired to state, on his first move, whether or not the top card is card N. 
He wins the game whenever he says correctly that the top card is card 
N, and he loses whenever he says incorrectly that the top card is card 
N. If he says correctly that the top card is not card N, the Player A 
removes that card, examines the new top card telling Player B whether 
or not it is larger than all cards previously removed, and Player B 
is again required to state whether or not the new top card is card N; 
of course, he says it is not card N if Player A informs him that it is not 
larger than all previous top cards. The problem is to give a rule for 
Player B to use in making his decisions that will maximize his chance for 
winning the game. The (approximate) answer is that he says each of the 
first N/e top cards is not card N, and that the next top card not announced 
as low is card N; his expectation of winning the game is (approximately) 
1/e if he uses this decision rule. (The exact formulas for the decision rule, 
and for the chance of winning, are quite easily derived. ) 

A sequential decisioning problem that I posed in 1950, as the "fiance 7 problem, " is the 

following: 

A young -girl wishes to marry the finest young man she can find. She 
has met several young men, and some of them have asked for her hand 
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in marriage. How can she best decide whether to marry one of the men 
she already knows or to continue her search for a still better husband? 

This problem was posed in order to show the importance of the restriction, in the 
game theory model, to a known set of alternatives whereas the critical decision is 
instead often that relating to alternatives unknown and even unsuspected. Or, in gene- 
ral, the usual practical problem is to decide whether to act on the basis of Information 
presently available about explicitly recognized alternatives or to search and reflect in 
the hope of discovering substantially better alternatives. 

If we grant that the mathematical fiance problem is not a perfectly valid representa- 
tion of the young girl f s problem, and such mathematical models are never completely 
faithful representations, we can nevertheless get some help from the mathematical 
solution as to the type of rule the young girl might best use. And in other problems, 
where the mathematical model is more valid, the solution may give important help; 
for example, in selecting the best item in a sample when destructive measurements 
must be made to determine the precise quality of each item even though an item may 
be compared to all its predecessors, the mathematical solution to the fiance problem 
would give a precise answer. Here again, there are real difficulties in making the 
model sufficiently realistic while the mathematical solution is only computationally 
troublesome. Other examples of applications are to be found where the passing of 
time, or the forgetting of information, plays the role of passing over top cards, yet 
with the capability of deciding whether the current case is or is not better than all 
previous experienced and in situations where the total number of possible cases is 
known precisely. 

Another kind of sequential game, and the one of greatest interest here, is typified 
by the "explorer's problem 11 : 

An explorer is searching for the point of greatest thickness of ice 
In a given polar region. He measures thickness by test borings, each 
of which costs him the same amount. Each inch of added depth disco- 
vered has a known value to him. His problem is to make a sequence 
of test borings, stopping when he believes that further exploration 
would not be apt to yield him further gain. 

There is no ready answer to this problem, although several recent results in sequen- 
tial decision theory are useful in planning an exploration of this kind. More generally, 
this type of problem assumes that there is some real but unknown pattern to the sur- 
face being explored and the need is for some systematic way of searching so as to take 



some advantage of this lack of randomness in depths over the region being explored. 
Furthermore, each measurement is subject to error so that several observations at 
any one point may differ appreciably from one another. An essential feature is that 
successive observations bear a cost that must be deducted from the over-all value of 
the completed task. Although there is some similarity between the explorer's problem 
and the fiance problem, they differ critically in that patterning was assumed not to 
exist in the fiance problem. 

The explorer's problem is the common one met by a design engineer. For example, 
if the problem is to design a chemical refinery then values of various parameters, 
such as sizes, pressures, and temperatures, must be chosen by the designer over the 
region allowed so as to optimize expected yield of the refinery. A calculation of ex- 
pected yield for any one set of design values bears a cost, and the designer must do a 
sequence of such computations (or trials) in his attempt to get a good design without 
too great design costs. Here the design parameters, perhaps in many dimensions, are 
analogous to the two geographical coordinates in the explorer's problem. Some ex- 
amples of this kind of sequential search problem will now be considered in more de- 
tail. 
The Alcohol Plant - An Example. 

D. S. McArthur [18] has constructed a simple analog computer to serve as an ab- 
stract representation of some of the more interesting decision-making problems en- 
countered by the manager of a refinery. He has called this computer the "Alcohol 
Plant, " and he has conducted some very interesting experiments with it in his investi- 
gation of the decision-making procedures that are used or that might well be used by 
plant managers. 

The Alcohol Plant has five process parameters, represented by settings on five 
dials, and a yield variable represented by a pointer reading on a voltmeter. There is 
also a dial, to be set by the experimenter, that introduces statistical variation into 
the yields. An experimental run consists in a series of parameter settings by a mana- 
ger, serving as experimental subject, after each of which the experimenter reports 
the yield after introducing appropriate statistical variation. The objective of the ma- 
nager is to find a set of five parameter values that produces high yield, and he is to 
discover the settings he prefers by making his sequence of trials in such a way that he 
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rapidly discovers a good set of parameter values. 

Under the rules used by McArthur, the subject is charged roughly one unit for each 
yield observation in his sequence and he is credited with four units for each point 
gained in yield over 24. The manager is told, before the experimental run, that: 

1) The average yield is exactly 24 if each process dial is set at 50, 

2) The statistical variation Is represented by a normal distribution 
with mean value 0, and standard deviation 2. 5, for any particular 
setting of the process dials, 

3) There is a unique setting for the process dials that produces the 
greatest possible average yield. 

4) Each process dial may be set anywhere between and 85 on its 
scale, and the maximum possible average yield is less than 50, and 

5) The functional relationships between yield and process parameters 
is continuous, in the technical mathematical sense. 

It can be seen immediately that no more than 100 observations can possibly be taken 
with profit, since 25 points is about the maximum possible gain in yield but worth only 
100 as contrasted with a charge of 100. These rules certainly seem to represent one 
kind of sequential decisioning problem met by process managers, and one where there 
is no positive assurance of eventual gain from even the wisest possible choice of ex- 
perimental program. 

The mathematical decisioning technique that might at first seem most suitable for 
this kind of problem is the Box- Wilson procedure [6] . In very rough terms, this 
procedure requires that a few sets of parameter values be tried, and the resulting 
yields observed, where the trial sets are arranged in a systematic manner about some 
particular central set. A sort of topographical contour map is then made to fit these 
observed yields as well as possible, and the resulting contours suggest the direction 
in which to change the parameter values so as to take the steepest path toward the 
summit sought. This method can be visualized in three dimensions as an ordinary con- 
tour map in which the two ground-position coordinates correspond to two process para- 
meter values and a map elevation corresponds to the average process yield resulting. 
Unfortunately, when there are even as many as five process parameters, the topo- 
graphical complexities possible are so great that the Box- Wilson procedure may very 
easily require a great number of successive recalculations before the summit is 
neared -- and there is also no way of knowing whether or not the actual summit Is or 
is not very much higher than the best yield observed at any particular stage of the cal- 
culation. 
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A simple single- parameter example may help to illustrate some of the difficulties 
encountered in problems like that posed by McArthur for the Alcohol Plant. Suppose 

that the average yield y and a parameter x are connected functionally by the relation 
y- x 5 . 10x 4 + 35x 3 . 50x 2 + 24x 

A problem of the kind we are considering requires finding the value of x between 
zero and five that yields the largest value for y. This would be a very simple problem 
in the differential calculus if we in fact knew the algebraic relationship between x and 
y; however, in our problem, we can find the yield y corresponding to a given para- 
meter value x only by conducting an experiment that provides an estimate of y that is 
subject to some error* For example, if the value x 1 were tried several times we 
would observe values for y clustered about y - 8. 3667; the spread of the cluster about 
this true average value would depend upon the amount of random variation in the pro- 
cess. 

The graph of our function is shown roughly in Figure 1. 



y 

16 

14 

12 

10 

8 

6 

4 

2 





15.83 




12345 x 

Fig. 1. 

The best average yield is obtained when x = 5. If one used the Box- Wilson procedure, 
with the erroneous conviction that there was a unique "peak, " then the parameter va- 
lue x 1 would very likely be selected eventually if the initial central point was a value 
of x less than two, and the summit at x= 5 would be selected only if the initial central 
point was a value of x not less than four. Of course, an expert with the Box- Wilson 
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procedure takes precautions against making erroneous assumptions but there is no 
certain way to avoid ail of these dangers and the chances for making such errors 
increase rapidly as the number of parameters grows larger. 

Another example, this time one with two parameters and superimposed error in 
yield, will help to show the characteristics of the more general sequential search pro- 
blem --as met in the Alcohol Plant, for example. We will call the parameters U and 
V, and the yield F, with the following conditions: 

1) The parameters U and V are expressed in percentage units; 

2) It is known that there is a unique pair of values for U and V 
that produces the maximum average yield F, and that this maxi- 
mum yield is expressed in percentage units; 

3) The yield F is a continuous and finite function of the two 
parameters; 

4) The cost of one observation on F is one-fourth the value of a 
percentage point in yield. 

Our problem, exactly similar to that for the Alcohol Plant, is to maximize net value, 
where net value is the difference between four times the true average yield correspon- 
ding to the pair of parameter values finally selected as a result of the experimental 
analysis and the number of trial observations taken. 

Since there is no especially good way in which to solve this kind of problem, and 
since our main interest now is in gaining a feeling for such problems, we shall pro- 
ceed in a crude, common- sense sort of way while making some use of "steepest 
ascent 11 techniques like those of Box- Wilson. We gamble with a start on 25 points, and 
obtain the following yield data: 



Yields Without Error 
(from Figure 2.) 
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From this initial sample it would appear that the parameter pair (100, 0), with an ob- 
served yield of 75, might be in a region worth further exploration. But before we do 
go on it will be worthwhile to examine our present situation a bit further. 

If there were no error in the yield measurements, then we could stop and be assured 
of a net value of <4 x 75 - 25 = 275). Furthermore, since maximum yield is certainly 



44 



30 35 40 45 50 55 60 65 70 75 80 85 90 95 1C 




5 10 15 20 



Fig. 2 



45 



not greater than 100, we could not take as many as 100 more observations without cer- 
tain loss; for example* also, if 25 more observations were taken the yield obtained 
would have to be above 81 for a profit. But there is error in the yield measurements, 
and of an amount as yet unknown, so there is no assurance that 75 is the actual aver- 
age yield. Again, on a common sense basis, we take five observations near (100, 0) 
and obtain the following yield data: 

Yields Without Error Yields With Error 
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Our three independent measurements at (100, 0) are 75, 69, 70; their mean is 72 and 
standard deviation is 3. 3, but these estimates are quite unreliable since they are 
based on only three observations. Common sense, perhaps aided by experience with 
contour maps in this case, suggests that pairs summing to 100, with U at least 90, 
may be relatively high in the region now under exploration; yields from five of these 
follow. 
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Logical contouring of all our data would suggest a rather high "ridge" represented 
by the parameter pairs where U + V = 100 and U >90, but now the pair (92, 8) has gi- 
ven a yield of 69 comparable with that averaging 71. 1 for the four observations on the 
pair (100, 0). Our single observation at (92, 8) has given us a much less reliable esti- 
mate than our four observations at (100, 0), the latter with an estimated mean of 71. 1 
and estimated standard deviation of 2. 6. Clearly, we need some more methodical way 
in which to nake our final choice even if we gamble on the best possible choice being 
in the general region now under exploration, and there is absolutely no assurance that 
distinctly better yields are not available near (25, 25) or at some other pair not yet 
tried. 
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The Box- Wilson procedure is indeed of some help once we commit ourselves to the 
assumption that there is a unique "peak 11 within the region under exploration, in this 
case where 90 <U < 100 and ^ V 10. This is accomplished by fitting a second de- 
gree polynomial in the variables U and V to the available data, by the method of least 
squares, and then calculating mathematically the pair of values for U and V that gives 
the greatest value for this polynomial over the allowable ranges for U and V. We shall 
not here discuss the many simplifying techniques for carrying out such a polynomial 
fitting computation, and we shall also not discuss the very important matter of proper 
selection of the pairs to be used in obtaining observations on yields; we shall be con- 
cerned only with the general concepts. 

In principle then, we need to find values for the coefficients of a polynomial 

P(U, V) AU 2 + BUV + CV 2 + DU + EV + F 

that passes closest, in the least squares sense, to our observed yields. Since we have 
observed on only 11 pairs and need to estimate six coefficients, our fitted polynomial 
is not too firmly determined; in practice we would also not repeat observations for 
one pair, as we have done here four times at (100, 0). Although this calculation has 
not been made for this case, it seems likely that it would in fact give as answer a pair 
close to (100, 0). It also happens that this problem, as set up, has the highest peak at 
(100, 0) where the average yield is 72. 

In the case of the Alcohol Plant, which has five parameters, it could not possibly be 
profitable to start by trying even three values for each parameter, since this would be 
a total of 3 = 729 pairs and at most 100 observations can possibly be taken with pro- 
fit. Even trying all the extreme pairs, -with two values per parameter, would require 
2 5 m 32 pairs. 

Our examples have shown how very formidable are problems of this kind, even with 
a few parameters; yet managers and engineers are constantly making decisions of 
just this kind, and apparently with considerable skill. 
Sequential Dec! stoning Formulas. 

There are several recent papers [11J on "stochastic approximation" that give very 
interesting mathematical procedures for solving the kind of sequential decisioning pro- 
blems we have been considering. Unfortunately, each of these procedures is quite 
limited in its scope of application. One of them, the Kief er- Wolf owitz procedure [8], 
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will be discussed briefly here as an example of this general approach. We shall here 
consider only the single parameter case, and for that only one of an infinite number of 
the applicable Kief er- Wolf owitz procedures. 

Our example is similar to the one used earlier, where yield Y and parameter X are 
connected by the broken line relation shown in Figure 3 
F(X) 



16.0 




Fig. 3 

Now, however, we shall superimpose an error in such a way that the observed value 
of Y will be Y + E, where is normally distributed with mean zero and standard de- 
viation unity. 

In application of the Kief er- Wolf owitz procedure to this problem: 

1) Choose a value Xj arbitrarily, such that <X < 5; 

2) Take yield observations for the parameter values (X. + 1) and (X,- 1), 



and call these G and H ; 
3) Calculate a value X 



(Gj - H ); 



4) Take yield observations for the parameter values /X + -L. 
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5) Calculate a value X " X + {G - H } ; 



6} Continue this process, getting observations G and H for 
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Since we have limited X to the closed interval (0, 5), we shall arbitrarily change 
n 

any negative argument in F(X) to zero, for the next step, and limit positive arguments 
to five should a larger value ever be given by the formula. A sample computation for 
our example is shown in Table 1. 

Although this sample calculation proves nothing, it does illustrate the manner in 
which the Kiefer-Wolfowitz procedure tends eventually to select the value X 5, and 
then tends to stay there. It also illustrates the lack of tendency to stay at the second 
highest peak, located at X = 3 Ln this case. 

There are several ways in which currently available stochastic approximation pro- 
cedures, such as the one of Kiefer-Wolfowitz, are not satisfactory: 

1) They depend upon the important restriction against multiple peaks, 
and upon other less severe but essential restrictions concerning the 
nature of the functional relationships between yield and parameters, 
even for assurance that the sequential search would ultimately settle 
down upon the desired parameter values; 

2) They may settle down too slowly to be of use, and the settling rate 
depends in a critical and. unknown manner upon certain constants that 
must be chosen to establish a specific procedure; 

3) They include no "stop rule, " that indicates when the calculation 
should be terminated if there is a cost of some kind associated with 
each additional step. 

There are several features of this type of procedure that make them especially at- 
tractive: 

4) They are "adaptive, " in the sense that continued application over 
time after some unsuspected change has occurred in the underlying 
relationship between yield and parameters will automatically shift 
to the proper parameter values; 

5) They require the same simple calculation after each observation, 
and do not require the retention of old data for use in future calcula- 
tions; 

6) They are reasonably self-corrective with respect to computational 
errors, since such errors have the same kind of effects as do other 
errors comprising the underlying system. 

For all these reasons, the stochastic approximation procedures would seem to have 
their most promising applications in systems where successive observations are fre- 
quent, where the underlying functional relationship is changing occasionally and at un- 
suspected times, and where profits are heavily dependent upon making appropriate 
corrections regularly during the operation of the system. 
Summary 

There is under development a set of new mathematical decisioning procedures that 
show great promise for effective application in managing and controlling systems 
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where the selection of best current operating conditions is redetermined continuously 
on the basis of past results. These procedures may be used effectively from the start 
of operation of a system, and their suitability in any particular instance depends only 
upon knowing that the system meets certain broad conditions imposed by a few rather 
unrestrictive mathematical conditions. The present paper offers a few illustrative ex- 
amples of these newer procedures, and makes some comparisons between them and 
certain other decisioning techniques based on the minimax approach of game theory. 
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LOWER BOUNDS FOR THE EXPECTED SAMPLE SIZE OF A SEQUENTIAL TEST 1 

Wassily Hoeffding 

Summary. This expository paper is concerned with lower bounds for the expected 
sample size E O (N) of an arbitrary sequential test whose error probabilities at two pa- 
rameter points Oj and G Z do not exceed given numbers a^ and a^, where E O (N) is 
evaluated at a third parameter point Q . The bounds in (1. 3) and (1.4) are shown to be 
attainable or nearly attainable in certain cases where Q lies between 0. and 0-. 
1. Introduction and main results. Let X , X , ... be a sequence of independent random 
variables having a common probability density f. (All results also apply to the case 
where the distribution of X is discrete and f(x) denotes the probability of X 1 x; in 
this case the integrals in the formulas are to be replaced by sums. ) One of two deci- 
sions, d. and d 2 , is to be made. Let f and f be two probability densities such that 
decision d (dj) is considered as wrong if f ^(^J- We shall consider sequential tests 
(decision rules) for making decision d, or d_, such that the probability of a wrong de- 
cision does not exceed a positive number ct^ when fa f. (i 1, 2). Let N denote the 
(random) number of observations required by such a test. This paper is concerned 
with lower bounds for E O (N), the expected sample size when f * Q , where f Q is in gen- 
eral different from fj and f^. 

The background of this problem is as follows. Suppose that f depends on a real pa- 
rameter and ^ corresponds to the value Q, where i < 2 * SuPP 08 * further that 
decision d 1 or d 2 is preferred according as < Oj or > Q^ and that neither decision 
is strongly preferred if Oj < < 2 . If we require that the probability of a wrong de- 
cision does not exceed a 1 ( a 2 ) if < 9 1 (6 > 2 ). the condition of the preceding para- 
graph will be satisfied. For a number of the common one-parameter families of dis- 
tributions (such as the normal distributions with mean and known variance or with 
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variance and known mean, or binomial distributions with mean 0) Wald's sequential 
probability ratio (SPR) test for testing the hypothesis Q I against the alternative 
= can be applied to this problem. 

The SPR test can be defined as follows [8 ] . Let a and b be two constants such 
that b < < a. If x ,x , . . . ,x are the first n observations (n > 1) and Z n denotes the 



f 2 (x n ) 



sampling is continued as long as b < Z n < a. Sampling is stopped as soon as one of 
these inequalities is violated and according as Z n <,b or Z^> a. decision dj or d 2 is 
made. It is known [ 10] that the SPR test for testing 0^ against Q^, with error proba- 
bilities equal to a j and at 2 , minimizes the expected sample size at these two para- 
meter values. In typical cases its expected sample size is largest when is between 
and , (that is, when neither decision is strongly preferred), and in general there 
exist tests whose expected sample size at these intermediate values is smaller than 
that of the SPR test. (A special case in which a SPR test minimizes the maximum ex- 
pected sample size will be discussed in section 2. ) 

In principle it is possible to construct a test which minimizes the expected sample 
size at an arbitrary Q value or minimizes the maximum expected sample size. Kiefer 
and Weiss [ 5] have proved important qualitative properties of such tests. They have 
shown that for one-parameter families such as those mentioned above, a test which 
minimizes the expected sample size at a value with Oj < < 2 can be defined in 
terms of two finite sequences of numbers, a j, a , . . . , a*, and b, , b~, . . . , b.., such 

that a > a , b , < b , b < a for n < M and b*. a w . Sampling is continued as 
n 1 ~* n n i n n n M. IA 

long as b n < Z & < a n and is stopped as soon as Z ft < b n or Z ft > a Q , and decision d.(d 2 ) 
is made in the first (second) case. Thus the test requires at most M observations. 

The actual determination of the numbers a and b and the evaluation of the expected 

n n r 

sample size and the error probabilities meets with difficulties which have not been 
overcome so far. Therefore attempts have been made to find a test which, without ac- 
tually minimizing the maximum expected sample size,, comes close to this goal, or at 
least substantially improves upon the performance of known tests . I mention in parti- 
cular the work of Donnelly [21 and Anderson ll] who, independently of each other, 
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considered teats like those just described, with a n " c + djn and b n c- + d 2 n, where 
c. > > dj and c < < d^. (Anderson also considered truncated tests of this type. ) 
Thus if the successive values of Z Z n (n- 1, 2, . . . ) are plotted in the (n, 2) plane, 
the boundaries at which the SPR test stops are two parallel lines, the boundaries for a 
Kiefer-Weiss test are monotone curves, the upper decreasing and the lower increa- 
sing, and the boundaries for a Donnelly- Anderson test are converging straight lines. 

The performance of these and other tests can, to some extent, be judged by compar- 
ing, at any parameter point 0, the expected sample size of the test with the smallest 
expected sample size attainable by any test having the same error probabilities at 
0. and 0^. In the ignorance of the minimum expected sample size, the comparison 
may be made with a lower bound for this minimum. If the discrepancy is small, both 
the test (as judged by this criterion) and the bound cannot be greatly improved. Our 
main concern will be with bounds which are best when is between 0^ and 0~. 

We admit arbitrary (in general, randomized) sequential tests which terminate with 
probability one under each of f Qf f,, 4 and f,. We also assume with no loss of generality 
that E (N) < . To exclude trivialities we suppose that a^ + e^ < 1. 

The first lower bound for the expected sample size was given by Wald (see [8] , 
p. 197) who proved for the case f o f that 



(1.1) E (N)> 

/f a [log(f!/f 2 ) ] dx 

and an analogous inequality for f Q f . (W aid's proof assumes a non- randomized test, 
but this restriction is easy to remove. ) It can be shown that both the numerator and 
the denominator in (1. 1) are positive; the integral in the denominator can be equal to 
+ , in which case the lower bound has the trivial value 0. The sign of equality in 
(1. 1) can be attained with a SPR test in the case where the ratio f}/f 2 takes on the two 
values C and 1/C only, provided that the values o^ and a^ can be achieved as error 
probabilities in this test. In certain other cases the sign of equality can be nearly at- 
tained with a SPR test. 

An extension of (1. 1) to the case of an arbitrary f Q has been given by the author [3]: 

' C 



- log[ C (l-Qf ) 1 - c + (l-o ) C a ' C 
= r - * - *- 
0<c<l c/f (lo g :2.)dx+(l-c) / f (log i- 
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(1.2) E (N) > sup - = r 

:.- - )dx 



For f = I l and c - 1, (1.2) reduces to (1. 1). This bound is likely to be close when 
f Q is close to fj or f 2 - 

In this paper two new inequalities will be considered, 

(1.3) E (N)> 

1 - / min(f ,f lf f 2 )dx 

and 

2 .,1/2 IA ?2 

{[ ( T/4) - f log ( gj* 2 )] - T/4 j 

(1. 4) E o (N) > -p 4 

where 

(1.5) f.max ( t lt * 2 ), t. Jf (log ^ )dx, i-1,2, 

and 

2 lo !i + 2 f dx 

Note that > ^^ strict inequality applying whenever f Q and f t are densities of dif- 
ferent distributions. 

In the proof of (1.4) it is assumed that, in addition to the existence of the integrals 
in (1.5) and (1.6), 

(1.7) f Q (x) implies mini fj(x), f 2 (x)] - 0, 
and that the equation 

N 2 2 

(1.8) E ( S Yj)*- T E (N) 

is satisfied, where 



Concerning the last assumption we note that j^ 2 "^ f o ^ Io S^2^1^ dx * 8O that 
E^Yj) -Oand, by (1.6), E o (Yj 2 ) - T 2 - Equation (1.8) has been proved by Wald [7J 
and Wolfowitz [11] under certain conditions; see also Seitz and Winkelbauer [6] . 
It certainly holds if N is bounded or if Y + Y m is bounded for m < N. It is clear 
that if condition (1. 8) is satisfied for a test which minimizes E O (N), then inequality 
(1.4) is true also for any other test. In particular this is true under the assumptions 
of Theorem 4 of Kiefer and Weiss 15] , which imply that if a test minimizes E O (N), 
then N is bounded. 



Inequalities (1. 3) and (1.4) will be discussed in the following sections. Proofs of 
the inequalities are given in [4] . 

2. pjg cuss ion of inequality (1.3). The sign of equality in (1. 3) can be attained in two 
cases, in both of which 
(2.D f (x)> mini f^x), f 2 (x) ] . 

(This condition is satisfied for many common one-parameter families of distributions 

when is between 0, and 0~). Under condition (2. 1) inequality (1.3) can be written as 
o L c. 



(2.2) (N)>- 



1 - /minff^Jdx I /(fj -fj dx 

The last equation is obtained by integrating on both sides of the identity 



In the first case where equality in (2. 2) can be attained, the densities f. are arbi- 
trary, subject only to (2. 1), but the values ttj and 2 are severely restricted by the 
condition that they be attainable as error probabilities by a test which uses at most 
one observation, x , and decision djfd^ is made if ^(Xj) - * 2 (i> is positive (negative). 
("At most" means that we may decide at random, with prescribed probabilities, be- 
tween taking no or one observation, and in the former case choose at random d or 

d r> 

The second case in which equality can be attained in (2.2) is where, in addition to 
(2. 1), the three densities f Q (x), fj(x), and f (x) are equal to each other on a set of 
points having a positive probability. In particular, let f^x) be the rectangular density 
which is equal to 1/L if - L/2 <x < 0. +L/2 and zero elsewhere, and let 
0< 2 - Oj < L, Oj < < 2 . Choose two numbers c and d such that 2 - L,72< c< d 
< 0. + L/2, and consider the following test. Sampling is continued as long as the ob- 
servations x x ,x , . . . fall into the interval (c, d). Sampling is stopped as soon as 
x n < c or x ft > d, and decision d^) is made if x n < c(x n > d). It can be calculated 
that in this case the error probabilities are 

Oj - d + (L/2) c - 3 2 + (L/2) 

a.- - _ , a,. - . 

1 L-d+c L-d+c 

Also 
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and 

E / N ) n L 
L - d + c 

Hence the two sides of inequality (2. 2) are equal. In this example equality in (2. 2) can 
be achieved for any values a and a such that j > 0, a 2 > 0, and a^ + ^2 ^' 
either by a suitable choice of c and d in the test just described, or by a test using at 
most one observation. Moreover, the expected sample size of the test here consi- 
dered, assuming that the distribution of x. is rectangular on an interval of length L 
with an arbitrary mean 0, can be shown to attain its maximum when 0. < < 0^. 
Hence the test minimizes the maximum expected sample size. 

The present test is a modified version of Wald's SPR test. To see this, let fj n 
stand for f.(x 1 )f.(x 2 ) . . f t (x n ). and let < B< A. The SPR test for testing fj against 
f 2> is defined in [ 9J as follows. (The definition differs slightly from that in 18] . ) 
Sampling is continued as long as B < * 2 n' f In < A * K one of these "^qualities is vio- 
lated, one proceeds as follows. If * 2n /* ln < B, hypothesis f is accepted. If 
2n /f ln > A * h yP thcsis f 2 is accepted. In the case B< A, if f^/fj =A or B, a ran- 
domized decision is made between taking another observation and accepting the appro- 
priate hypothesis. In the case B - A, if * 2 u /f ln * Af a randomized decision is made 
between the three possibilities of taking another observation, accepting f, and accept- 
ing f . In our example the ratio f, /f, takes on the values l f - and* (except 
c &n in r- 

that the ratio is not defined if f^. 0), and the test of the preceding paragraph 
is essentially the SPR test with A- B - 1, except that randomized decisions are re- 
placed by non- randomized ones. 

It is of interest to note that the bound in (1. 3) is always positive whereas the bounds 
in (1. 1) and (1.2) take on the trivial value if the integrals in their denominators are 
equal to + o . However, in most of the common cases the bounds (1.1) and (1. 2)(as 

well as (1.4)) are better than (1.3). For instance, if f , f , and f are normal distribu- 

o 1 2 

tions with a common variance and respective means 0, - 5 and 6 , the bound in (1.3) 
is of the order d " 1 > but those in (1. 2) and (1 . 4) are proportional to 6 " 2 and hence 
better than (1.3) if 6 is small. 

3 - Discussion of inequality (1.4). Strict equality in (1.4) cannot be achieved except in 
trivial cases. To obtain an idea of how close the bound in (1.4) can come to the mini- 
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mum attainable value of E Q (N), we shall consider the following special case. Let f. be 
the normal probability density with variance 1 and mean , where O o = 0, 0, . - fi 
and 2 - 6 > 0. - Then j - 2 " 5 2/2 T = 2 , and inequality (1. 4) becomes 

(3. 1) E (N) > 6 " 2 U 1 - 2 log (Za) ] 1/2 - 1 } 2 

where 2 a a^ ot^ This bound will be compared with the values of E (N) for a 
fixed sample size test, Wald's SPR test, and a test considered by Anderson, with er- 
ror probabilities Ofj- ot^* ot(< - Hn each case. 

Let S n Xj + + X n . For a fixed sample size test such that decision dj or d- is 
made according as S^ < or S^ > 0, the error probabilities at - fl and 6 are 
both equal to $ (- 6 n ), where 



Hence E O <N) is the least n such that Sf-fin 1 2 )< a. IfX=X (a) is defined by 

* (-X )= a we have 

(3.2) E Q (N) d' 2 X 2 , 

exactly or with a good approximation. If a 0, then \ and 



Hence 



X 2 - 2 log a + O [ log (-2 log a ) ] . 



The factor of 6 ~ 2 in inequality (3. 1) is 

{[ 1 - 2 log (2 a)] 1/2 - 1 } 2 - 21oga + O[ (-21oga) 1/2 ] . 
Thus if a is small enough, the bound in (3. 1) is nearly attained with a fixed sample 
size test, although the asymptotic approach is extremely slow. It follows that the 
fixed sample size test nearly minimizes the expected sample size at when a is 
(very) small. 

Now consider the SPR test which stops as soon as 26 | S | > log A (> 0). Then 

992 2 

(log A) < 4 6 E (S N ) 4 6 E Q (N) by (1. 8), and A < 1 " tt . These inequalities are 
close approximations for a fixed and <5 small enough (Wald [ 8 ] ). With this approxi- 
mation, 



Here O( X~ 2 ), order of X" 2 , denotes a term such that X 2 O{ X* 2 ) is bounded as 
\ o . The O terms in the following equations have an analogous meaning. 
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(3.3) E Q (N) = " 

Put er (1 -c )/2, then 

{ i to .Hfi| 

and 



Thus if a is close to its upper bound | and 6 is small enough, the lower bound in 
(3. 1) is nearly attained with a SPR test. Hence the SPR test nearly minimizes E Q (N) 
in this case. Table 1 shows that even for a - 0.2 the expected sample size exceeds 
the lower bound by only 3%. (The lower bound in (1.2) with c | also approaches 
E Q (N) for the SPR test as a > |. However, inequality (3. 1) is better than (1. 2), as 
applied to the present case, for all values of a . ) 

For a values not close to or i we compare the bound in (3. 1) with the expected 
sample size of a test considered by Anderson [1] . This test stops as soon as 

|S n | > c +dn, where d< < c. Anderson approximated the sequence S n J by a 
Wiener process so that his values for the expected stopping time, E Q ( r ), when the 
mean of the process is are approximations to E Q (N). He chose the constants c and 
d so as to minimize E (T) subject to prescribed error probabilities a, a- a at 

O x & 

= + 6 , for 5-0.1 and a =0. 01 and 0. 05. Anderson's values are given in Table 1. 
The expected sample sizes exceed the lower bounds by only 3. 6% and 2.8% , respec- 
tively. This shows that both Anderson's test (as judged by the expected sample size 
at = 0) and inequality (3. 1) cannot be greatly improved in these cases. 

Table 1 
Values of E Q (N) for 6 0. 1 and a. j a 2 * a 



a 


0.0001 0.001 0.01 0.05 0.1 0.2 0.3 


Fixed sample size 
SPR test 
Anderson's test 
Lower bound (3,1) 


1383 
2121 

1054 


955 
1193 

710 


541. 
527. 
402. 
388. 


2 
9 
2 
3 


270. 
216. 
192. 
187. 


6 

7 
2 



164. 
120. 

111. 


3 

7 

1 


70. 
48. 

46. 


8 


6 


27.5 
17.9 

17.8 



It is shown in [4] that for each of the two sequential tests here considered the ex- 
pected sample size attains its maximum when the mean of the normal distribution is 
0. In conjunction with the preceding results this implies that each of these tests (as 
well as the fixed sample size test) comes close to minimizing the maximum expected 
sample size for certain a values. 
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To summarize, we have seen that in certain cases the lower bounds for the expected 
sample size of a sequential test which are given by (1.3) and (1.4) come close to the 
smallest attainable expected sample size. We also have seen which tests come close 
to minimizing the expected sample size at certain parameter points. The bound in 
(1.3) can be strictly achieved for some special distributions which, however, are rare 
in applications. The bound in (1. 4) is closely approached by Anderson's test for the 
usual values of a like 0. 05 and 0. 01 in the example which we have considered. (In 
Anderson's paper [ 1] it is shown that the expected sample size of his test when 
G- 6 or 0=6 (in our notation) does not considerably exceed the smallest attainable 
expected sample size, that is, the test does only slightly worse than the SPR test at 
these parameter points. ) Although in this section we have discussed only the special 
case of a normal distribution with mean 0, similar results undoubtedly can be obtained 
for many other common types of distributions when Q is roughly midway between 0. 
and 2 , and a^ and a 2 are approximately equal. 
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ON SOME ASPECTS OF MODELS OF COMPLEX BEHAVIORAL SYSTEMS 1 
David Rosenblatt 



1. Introduction. 

In this paper we propose to treat some formal and pragmatic aspects [13] of certain 
models of complex behavioral systems. These models relate to generalized resource 
flows and entail stochastic process representations of system activity. 

For present purposes, the term 'model of complex behavioral system 1 is essentially 
intended to convey the following set of notions. First, we mean a formulation of the 
properties of an abstraction called an 'entity' relative to a discrete index set, the 
latter called conventional system time. Second, an 'entity' is in general taken to ex- 
hibit some distinguished integral properties which may be functionally stated in terms 
of the properties of its proper parts, but which no proper 'entity part' may manifest. 
Third, an 'entity 1 is regarded as a construction of certain distinguished proper parts 
called 'sub- entities' in accordance with well-defined sets of rules of composition. In 
the systems of present interest, the parts called 'sub- entities' are taken to condition- 
ally exhibit behavioral properties governed by specified finite-dimensional stochastic 
processes. 

The particular models we propose to treat may be viewed as examples of complex 
behavioral systems drawn from two domains: the domain of statistical economics and 
the domain of information logistics. In effect, we consider certain provisional frame- 
works for the description of large-scale mass 'distributive 1 or 'flow 1 phenomena. 
These phenomena are construed as the conjoint outcome of the 'decision-making' 
activities of resource- connected 'entities' in time. The aspects which we take to be of 
special interest in this paper relate to the abstract concepts of balance, closure, and 



This paper was prepared as a part of the project "Symbolic Methods in the Study of 
Organizations" under Contract Nonr-1180(00) wtth the Office of Naval Research and 
with further support under Contract Nonr-76l(05) of Project NR-047-001. Some of the 
results given here were presented in a preliminary version in a paper entitled "On 
Stochastic Process Representations of Economic and Accounting Activity, " read before 
Section K of the Americal Association for the Advancement of Science, December 30 
1958, in Washington, D.C. ' 
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interaction. 

2. Relation Theoretic and Graph Theoretic Considerations. 

The theory of finite homogeneous binary (or dyadic) relations developed by C. S. 
Peirce and E. Schroder [14, 20| may be taken to inform investigations of complex sys- 
tems. A homogeneous binary or dyadic relation on a set<r of n elements a,, a,, . . . , a n 
is construed as any rule p -which specifies for each ordered couple (a,, a>) of elements 
of (r that either the relation p obtains between a and a. (symbolically a. p a ) or that 
it does not obtain (symbolically a p a ) for i, j 1, . . . , n. It is well established that 
homogeneous binary relations defined on finite sets of elements can be represented in 
1-1 fashion by means of two equivalent formalisms: (i) by finite Boolean relation 
matrices of zeros and ones [ 1, 14] ; and (ii) by finite directed graphs [11, 14] . 

The 1-1 representation of binary relations P on <r by square Boolean relation ma- 
trices R II r II (i, j = 1, . . . , n) is defined by 




The null relation A then corresponds to the null matrix A II r.. II , r.. for 

f 
all i, j; the universal relation V to the universal matrix V I! r . II , r.. 1 for all i, 

j [26]. The identity relation I corresponds to the identity matrix 1^* II r^. II, 
r 1 if i= j and r =0 if i Jj for i, j !,..., n. The relation-algebraic operations of 
negation, conversion, union, intersection, relative addition and relative multiplication 
may then be given relation matrix representation by means of the classical formalism 
of Boolean algebra. Thus, Boolean relation matrix multiplication corresponds to the 
operation of relative multiplication of relation algebra [1, 26] . 

The graph theoretic representation of binary relations p on <r may be conveniently 
stated in terms of the 1-1 representation of finite-dimensional Boolean relation ma- 
trices by finite directed graphs. Given any square Boolean relation matrix R* II rj.ll, 
(it j 1, . . . , n), the graph of R, G(R), consists of n objects 1 . . . , n called vertices 



or points and the totality of ordered pairs of vertices o^, a . such that o^, of exists 
if and only if r =1 in R- II r.. II . The ordered pair (or edge or directed line) ttj, or, 
is represented by an arrowed line directed from a. to a ., with arrowhead pointing 
toward a .; edges of the form a 9 of are taken to be admissible for any vertex a . 
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of G(R). A subgraph of graph G is a subset of the edges and vertices of G containing 
with each edge its terminal vertices or end points. With this 1-1 representation, it 
is then possible to designate the Boolean relation matrix R(G) corresponding to any 
given finite directed graph G. In this paper, we employ "graph" for "finite directed 
graph". Any given subgraph H of graph G, (H C G), may then be represented by the 
submatrix R(H) (in the general sense of subrelation) of the Boolean relation matrix 
R(G) corresponding to G. 

The Boolean matrix representation and, equivalently, the graph theoretic represen- 
tation of finite nonnegative square matrices is of interest in the present study. Clearly 
any finite nonnegative square matrix may be regarded as a system of (nonnegative) 
'valuations 1 imposed upon or assigned to a homogeneous binary relation defined on a 
finite set. Let A II a^ II denote a nonnegative square matrix of order n. The square 
Boolean relation matrix R A II r.j II of order n is then defined by 
1 if a l;j > 0, 

if a j 0, (i, j-l.....n). 

For all finite matrix powers A q of A and R q of R the following holds: r./ q ) * 1 if 

A A ij 

and only if a W > 0, where A q II a * q * II , R A q * II r J q) II for i, j * 1, . . . , n. 
Here, R A ' ** the q'th power of the relation matrix R A obtained by conventional ma- 
trix multiplication subject to the usual Boolean rules for addition and multiplication of 
matrix elements: sum x + y max (x, y) and product x y min (x, y), where x, y 
assume only values 0, 1 and the ordering < 1 obtains [ 1] . 

If A is a nonnegative square matrix of order n with Boolean matrix representation 
R A , then the graph G(R A ) will be called the graph of A. The sequence of powers 

?A ; k m 1, 2, ... | of A clearly entails the existence of a sequence of graphs G t. \ 

c R A K 3 

which may, however, be shown to be finite in number as distinct graphs [ 17 J . 

We consider next a series of definitions relating to certain distinguished classes of 
graphs which depend upon the notion of connectedness. A vertex a of graph G is said 
to be connected to a vertex ft in a subgraph H C G if H contains edges 



and y m * * . It is convenient, in this context, to say that in G is attainable 
from a in m steps by means of a directed path. 
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We now introduce the concept of a "cyclic net". A subgraph H C G is said to be a 
cyclic net of order m if and only if H contains na(m> 0) vertices of G and each vertex 
of H is connected to every vertex of H. A cyclic net H of order m in graph G is said 
to be simple or Peircean if and only if no proper subgraph K C H is a cyclic net. A 
cyclic net H of order m in graph G is said to be maximal in G if and only if every cy- 
clic net in G is a subgraph of H or contains no vertex in common with H. A cyclic net 
H of order m in graph G is said to be universal if for some positive integer q every 
vertex of H is attainable in q steps from some vertex a in H. A cyclic net H of order 
m in graph G is said to be closed in G if and only if H is a maximal cyclic net in G 
and every vertex of G attainable from any vertex in H is contained in H. The varieties 
of cyclic net may manifestly be depicted by diagrams. 

It may be shown that a cyclic net of order m > 2 is universal if and only if the great- 
est common divisor of the orders of all simple cyclic nets contained therein is unity 
[17] . Clearly, a cyclic net is at once simple and universal if and only if it is of or- 
der one. 

The graph theoretic concept of cyclic net may be shown to correspond biuniqueiy to 
the concept of inde compos ability or irreducibility. A nonnegative square matrix A of 
order n is said to be indecomposable or irreducible if for no permutation matrix F 
(with transpose F T ) does 



A 

where A. ,, A are square matrices [2,7, 27] . The following theorem of corres- 
i 1 22 

pondence may be established [ 17 ] : If A is a finite and nonnegative square matrix, 
then A is indecomposable (or irreducible) if and only if the graph G(R^) is a cyclic net. 
If the preceding Boolean representation of square matrices is generalized so that 
PJ. - in R if and only if aj,. - in an arbitrary matrix A, it is clear that the theorem: 
of correspondence holds generally [9,17] . 
3. Concepts of Balance and Closure. 

We consider next some formal properties of finite nonnegative square matrices 
which find significant applications in statistical economics, in the domain of general- 
ized double-entry accounting, and in the theory of stochastic processes. 

We first state two definitions relating to nonnegative square matrices of order n. A 
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finite and nonnegative square matrix A II a.. II will be said to be (row) substochas- 

n 

tic if no row sum of A exceeds unity, i.e., *,. > 0, r. 5 S a.. < 1 for all i; if 
U - l jm l J ~ 

each row sum of A is exactly unity the matrix is said to be stochastic. Next, a finite 
and nonnegative square matrix X * II x. . ii will be said to be a balanced margin matrix 
(or to exhibit the balanced margin property) if each indexed row sum of X is exactly 
equal to the correspondingly indexed column sum of X, i. e. y r. c^ for i 1, . . . , n, 
where 

n n 

* iS * lj andCl " h -i *" ' 

To exclude the trivial case, we exclusively consider nonnegative matrices distinct 
from the null matrix. For the sake of historical definiteness, we will also call the 
balanced margin property for nonnegative (or, more generally, real) square arrays 
the Pacioli - Stevinus equalities [ 23 ] . 

The preceding definitions can be given a concise formulation. Let g denote the 
column vector of dimension n with all elements unity. The nonnegative matrix A is 

stochastic if Ag = g and is substochastic if Ag < g. The nonnegative matrix X exhi- 

T 
bits the balanced margin property if Xg X g. 

Some of the formal relations which subsist between stochastic matrices and balanced 
margin matrices are of general interest and can be simply stated. To do this, we re- 
quire two definitions. First, a linear system of the form x(I- A) w, I the identity 
matrix, will be called a finite substochastic system if A is a substochastic matrix and 
w is a nonnegative (row) vector; if A is a stochastic matrix, the linear system is 
called stochastic. Second, a solution x of a substochastic system x(I- A) - w will be 
called admissible if x is finite and nonnegative but not null. It is obvious that in a sto- 
chastic system x(I- A) * w, admissible solutions exist if and only if w Q, the null 
vector of appropriate dimension. 

The following proposition which relates stochastic matrices to balanced margin ma- 
trices is of interest. Let D(u) denote a diagonal matrix containing the ordered compo- 
nents of a row (or column) vector u on the diagonal. 

PROPOSITION 1: Let A be a stochastic matrix. Let x be a nonnegative row 
vector. Then D(x)A is a balanced margin matrix if and only if x is an ad- 
missible solution of the stochastic system x(I- A) 0. 
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PROOF: Let e denote the row vector with all elements unity, viz. , e = g T . Di - 
rectly, eD(x)A eA T D(x) if and only if xA = x. 

The preceding proposition simply states, in effect, that row normalization of a ba- 
lanced margin matrix X with positive (row and column) margins produces a stochastic 
matrix A with the margin of X (written as a row vector) an admissible solution of the 
stochastic system x(I- A) - ; moreover, given an admissible solution x of the system 
x(I- A) * 0, then D(x)A exhibits the balanced margin property. 

We observe but do not prove here, that if it is possible to obtain a stochastic matrix 
A by normalization of a given balanced margin matrix X, then all admissible solutions 
of the stochastic system x(I- A) * can be stated in terms of the (necessarily) posi- 
tive margin vector eX and the graph G(R X ) of X (cf. [ 19] and Theorem 3 of [IB ] ). 
In fact, it may be shown that in the graph of an arbitrary (nontrivial) balanced margin 
matrix, to each index of a nonnull row there corresponds a vertex located in a closed 
cyclic net of the graph ( [19] and Theorem 3 of [18] ). Consequently, the graph of a 
(nontrivial) balanced margin matrix is composed of one or more disjoint closed cyclic 
nets and possibly contains isolated vertices corresponding to indices of null rows and 
columns . 

We consider next a proposition relating to balanced margin matrices which finds 
application in certain general representations of "dynamic economic equilibrium" [6 ]. 
The proposition further applies to certain large-scale inter Indus trial ("input- output") 
models, multi sector trade or exchange models, and formulations of macroeconomic 
stability (cf. [2] for bibliography). The fundamental abstract conception underlying 
all these models may be shown to go back directly to the stationary process represen- 
tation of the Tableau Economique formulated by the biologist-philosopher Francois 
Quesnay (Tableau Economique, published in several versions in 1758 and 1759) [15,21, 
22] . The several studies of Quesnay constitute logical precursors of the investiga- 
tions of A. J. Lotka and V. Volterra in mathematical biology [12, 25] ; the several 
studies, moreover, exhibit the strands of ancient philosophic doctrines. The original 
tableau economique representation employs a type of 'circular flow 1 or recurrent event 
formulation of generalized 'accounts 1 which are effectively stated in double-entry form 
[15] 

The proposition of interest rests on the Perron- Fro benius theory of nonnegative ma- 
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trices [7, 27] . This theory contains the following result: Any indecomposable (or 
irreducible) nonnegative square matrix A exhibits unique positive (normalized) left and 
right eigenvectors associated with a simple eigenvalue X > such that for any eigen- 
value a of A, | a | < X . The eigenvalue X of maximum modulus is called the 
spectral norm of A; in the following, ail eigenvectors are conventionally taken to be 
nonnegative. 

PROPOSITION 2: Let A be an indecomposable nonnegative square matrix 
with spectral norm p . Let x, y respectively be left and right normalized 
eigenvectors of A associated with p . Then D(x)AD(y) is a balanced margin 
matrix. 
PROOF: Directly, e[D(x)AD(y)] - xAD(y) = pxD(y ) P y T D(x) =* y T A T D(x) 

-e [D(y)A T D(x)] . 

From the standpoint of generality, it is clear that the generalized balanced margin 
property is intrinsic to 'eigenproblems 1 : Let C be an nxn complex matrix with u, v 
left and right eigenvectors of C associated with eigenvalue X . Then S D(u)CD(v) is a 
generalized balanced margin matrix, i. e. , Sg S T g. 

The proposition we consider next contains some immediate consequences of the two 
earlier propositions (cf. [7, 27] ). 

PROPOSITION 3: Let A be an indecomposable nonnegative square matrix 
with spectral norm p. Let x, y respectively be left and right normalized 
eigenvectors of A associated with p . Let s denote the scalar (x, y) and let E 
denote the diagonal matrix D(x)D -1 (y). The following then hold: 

(i) s"*D(y) [ EAE" 1 ] D(x) is a balanced margin matrix with margin given by 
p" 

the stochastic row vector s~ l xD(y); 

(ii) D" (x) [EAE" 1 ] D(x) is a (row) stochastic matrix with normalized left 
P 

invariant vector given by s~ 1 xD(y); 

(iii) D(y) [ EAE' 1 ] D -1 (y) is a (column) stochastic matrix with normalized 
P 

right invariant vector given by s" 1 D(x)y; 
(iv) x I AE" 1 ] y T and [ AE' 1 ] x T y; 
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(v) [EA] y = x T and y T [EA] = x. 

p" P 

From, the preceding proposition, it is clear that the 'extremal' y constitutes a right 

unit for the row stochastic matrix D~ (y) A D(y) of (ii)j analogously, the 'extremal 1 x 

P , 

constitutes a left unit for the column stochastic matrix D(x) A D (x) of (iii). 

P 
The results of Proposition 3 find application in a formulation of the following charac- 

ter which occurs frequently in certain resource allocation problems (cf. [Z ] for refer- 
ences, and [6] ). Consider an arbitrary but fixed nonnegative matrix A which may be 
taken to be a matrix of 'resource flows'. Consider next arbitrary positive vectors 
and cu , such that ft is an (nx 1) column vector and cu is a (Ixn) row vector. Let 
A. S A denote the (n+1 x n+1) matrix 



The matrix X is clearly indecomposable or irreducible for In the graph of A each ver- 
tex a (i - 1, . . . , n) is connected in one step to the vertex a n + 1 which in turn is con- 
nected in one step to every vertex a . ; there then exist at least n simple cyclic nets of 
order 2 in the graph of A. Moreover, if A is distinct from the null matrix, then the 
matrix A is necessarily primitive (i.e., exhibits a single root of maximum modulus 
[ 7, 27] ) so that all powers X k are surely positive for all k_> n 2 + 1 [ 17, 27] . The matrix 
A is primitive if and only if the graph G(R) is a universJJ. cyclic net [ 17] . But G(R) 
is a universal cyclic net if a^> or if ay >0 (i^j); for the graph contains at least n 
simple cyclic nets of order 2 and a u >0 or a^X) (i^j) respectively entail the existence 
of a simple cyclic net of order 1 or of order 3 in the graph (cf. Section 2). 

From the preceding propositions, it is clear that the indecomposable and nonnega- 
tive matrix A ^ w can always be simply transformed so as to exhibit the balanced mar- 
gin property. Let x, y respectively denote the normalized left and right eigenvectors 
of A >w associated with the spectral norm X of A-, w . Clearly, D(x)2j5(y) is a ba- 
lanced margin matrix, where A w . Let *x be written as [x, 



u 



where x is a (1 x n) vector and y is an (n x 1) vector. Consider the matrix 



equations 
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(1) 

and 
(2) 



[x, 



X[x, 



ir 11 

11 w " 



By simple transformations and using the fact that ( XI- A) exists since A is inde- 
composable, we then obtain 
(3a) x x w( XI- A)" 1 , 

(3b) y(M-A)" 1 y n+1 ^ , 

(3c) (w,y)/y (x,0 )/x . x ; 

n-ri nvi 

equivalently, w ( X I - A)~ 1 P - X . 

.-lxv. f ^ti** _ , S\ ^,*'* f ^~ly 



By Proposition 3, the matrices/^ H D -1 (y) A D(y) and dm D(x) X D -1 (S) are 

respectively row stochastic and column stochastic. (^ has the left stationary stochas- 
tic vector s" xD(y) and & has the right stationary stochastic vector s D(x)y, where 



the scalar s (x, y). Thus, one may write P and L in the following manner: 
(4a) 

7> 



D'^yJADty) 



D(x)A 
X 



Moreover, it is clear that one may express the relevant stationary (stochastic) vectors 
for JP and & respectively as follows: 



(5) 



. 
where P n+1 S* 



(6) 

where 



<p w * 



(I- A f D(y) J f 



, 



I the stochastic row vector w * * w D(y) / X y 



- ({D(x)d- r 1 



J 



-1 



^^ and the stochastic column vector ft * - D(x)j3 /X x n+r 

The matrix D" (y) ~ D(y) is row substochastic and contains no (row) stochastic prin 
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cipal submatrix; analogously, D(x) D~ (x) is column substochastic and contains no 
(column) substochastic principal submatrix (cf. Theorem 1* of [ 18] ). These proper- 
ties follow directly from the essential inde compos ability of A (tf . 

It seems clear, from the development so far, that models of complex behavioral 
systems (e.g., models of 'resource flows') involving interconnected or functionally 
interdependent parts can under certain circumstances be formally depicted as Markov 
processes, more specifically, as finite indecomposable Markov chains with discrete 
parameter [3, 4, 5] . This is the case for certain representations which introduce 
(finite) indecomposable nonnegative matrices and in which the concepts of 'balance 1 
(e.g. , resource balance) or of 'stationary distribution' or of 'stationary flows' play an 
intrinsic role [18] . In such cases, the transformed nonnegative matrix regarded as 
the transition matrix of a discrete parameter (time homogeneous) Markov chain exhi- 
bits a graph ('transition diagram 1 ) in which the vertices correspond to states and the 
edges to one- step transitions between states. In many of these formulations, the pre- 
ceding concepts are treated as equivalent to or are associated with some notion of 
'equilibrium'. More generally, indecomposable structure is clearly not a requirement 
for a Markov chain representation of such models [ 18 ] . 

The notion of a balanced margin matrix is a relative concept. A proper principal 
submatrix A of a nonnegative matrix C may in general exhibit the balanced margin pro- 
perty when C does not; the converse also holds generally. Consider an indecomposable 
nonnegative square matrix 



E 21 B 
where A, B are both square and indecomposable. Let z S [z , z ] and w [ Wj, w^] 

respectively denote left and right eigenvectors associated with the spectral norm /A 
of C; for convenience, let the individual component vectors of z and w be written as 

stochastic vectors. Let x , y (x , y ) respectively denote normalized left and right 
A A B B 

eigenvectors associated with the spectral norm X A ( X B ) of A (B). In order that the 
principal submatrices A, B exhibit the balanced margin property respectively for 

(x , y ; X ) and for (x , y ; X.J when C exhibits the balanced margin property for 
A A A B B ** 

(z, wj IJL ) it is necessary and sufficient that the following hold: 
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where the inequalities p > \ . M > * obtain and are consequences of the inde- 
A o 

composabiiity of C [ 27 ] . 

We consider next the closure representation or completion of any finite substochas- 
tic system x(I- A) - w, (I- A) singular or not. This representation enables a solution 
algorithm to be formally depicted as a finite-state time homogeneous Markov chain 
(discrete parameter); in effect, a 'computation 1 is replaced by an equivalent 'process 1 . 
In simple cases, the final statistical equilibrium vector coincides, except for a scale 
factor, with the solution(s) of the system x(I- A) w. Closure bears a direct relation 
to balanced margin considerations. 

The completion or closure representation of a substochastic system x(I- A) w of 
order n entails the embedding of the system in a well-defined stochastic system of or- 
der (n+1). It then becomes possible to characterize the solution structure of the ori- 
ginal system in terms of the solution structure of the containing stochastic system. 

Consider a substochastic system x(I- A) = w of order n, (I- A) singular or not. The 
matrix (I- A) is singular if and only if A contains a stochastic principal submatrix[18i 
if (I- A) is nonsingular, then the inverse may be stated in the form of the Neumann 

series, (I- A)" - E A . Let denote the sum of the components of w, i.e., 

h-0 w 

^w * Wg> g the column vector with all elements unity. Let the row vector w* be defined 

as follows: w* tf^ w if w j* 9^, w* 0^ otherwise, n the null vector of dimen- 
sion n. 

AS 

Let A^ denote the square stochastic matrix of order (n+1), 
jiA (I-A)g 

II w* l-w*g 

The stochastic system of order (n+1), z(I- A^) * will be called the closure 

(representation) or completion of the substochastic system x(I- A) = w for given w, 
where z (z, z R>f j), z a row vector of dimension n. 

The following proposition characterizes the admissible solutions of a finite substo- 
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chastic system in terms of the solutions of Its closure. 

THEOREM: Let ac(I- A) aw be a substochastic system of order n with closure 
(I- A W ) O n+1 - A vector x is an admissible solution of the system x(I- A) * w if and 
only if (x, 0^) is an admissible solution of the closure of the system. The system 
x(I- A) w exhibits no admissible solution if and only if every admissible (left) sto- 
chastic solution of the closure contains a last component equal to 1 - w*g. 

PROOF: From the definition of closure, it follows directly that any substochastic 
system which exhibits at least one admissible solution is equivalent to the constrained 
stochastic system (x, ^ W )(I- A^) c n +i- & remains to consider the last statement of 
the Theorem. The 'if 1 part follows from the first statement since 1 - w*g ***> assume 
only the values, 1, 0. If 0^ 0, l-w*g 1; and if j* 0, l-w*g = 0; both cases are 
inconsistent with the existence of admissible solutions for the system x(I- A) * w. We 
consider the 'only if 1 part of the last statement. There are two cases and these depend 
on the regularity of (I- A). If (I- A) exists, admissible solutions fail to exist only if 

w m Q . The index (n+1) of A^ is then an 'absorbing' index or state and, consequently, 

t-j 

the unique stationary stochastic vector of A is (0., 1). H (I- A) is singular, admiss- 

w n 

fj 
ible solutions fail to exist only if the (n+l)'st row of A^ contains positive entries in 

columns associated with one or more stochastic principal submatrices of A W (cf. 

A/ 

Theorem 4 of [ 18] ). The index or state (n +1} of A W is then 'transient 1 and thus all 
stationary stochastic vectors of A are null in the last component. This completes the 
proof. 

The completion or closure z(I- A ) -0 . always has admissible solutions. This 
follows directly from the so-called mean ergo die theorem for finite stationary Markov 
chains (discrete parameter) [3, 4, 5 ] . The stochastic matrix 

A/ .! * *l t, 

A w * - lim s l S A 
600 hl 

fj t* 
always exists and it is known that every admissible stochastic solution of z(I- A W ) 

/v 

9 is given as a convex linear combination of the rows of A * [3 ] . In the pre- 

ceding result, relative to the non-existence of admissible solutions the 'testing sca- 
lar' 1 - w*g can assume only two values: 1 if w C R and if wjf Q n . Admissible solu- 
tions fail to exist in the first case only when (I- A) is non singular so that all the "pro- 
bability mass" accumulates in the (n+l)'st state which is a unique absorbing state in 
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the Markov chain with fixed transition matrix A W The second case of non-existence of 
admissible solutions occurs only when (I- A) is singular and "probability mass" from 
the vector w j* enters and irreversibly accumulates in the closed cyclic nets associ- 
ated with one or more stochastic principal submatrices of A in the Markov chain with 
fixed transition matrix A W [ 17, 18] . 

For finite substochastic systems x(I- A) w which frequently occur in some appli- 
cations (e.g.* in statistical economics or in generalized resource accounting 1 2, 10, 
24] ), viz* (I- A) nonsingular, A nonnull, and w positive, the preceding Theorem 
leads to a simple iterative algorithm with possibly advantageous round- off and error- 
stability properties. The (infinite) algorithm may be made applicable under more ge- 
neral conditions but depending on the structure of the graph of the matrix A may have 
dubious efficiency, e.g. , in case the matrix A is decomposable. Under the stated con- 
ditions, the stochastic matrix A is the transition matrix of a regular Markov chain 
(with graph a universal cyclic net, cf. p. 65) so that the powers of A W converge expo- 
nentially fast to a limit matrix equal to A^*. The recursion z u+i ( w ) = z ^ w ^\v 
(k = 0, 1, 2, . . . ), z an arbitrary initial stochastic vector, accordingly converges ex- 
ponentially fast to the limit "statistical equilibrium" distribution vector p w of the reg- 
ular Markov chain, where p w = (p n+1 w*(I- A)" 1 , p a+1 ). Clearly, (a w /p n+1 )p w 
* (x, w ), where x - w(I- A)" I short, the constrained solution of the closure is 
given by a scalar times the "statistical equilibrium" vector p w , where the scalar is 
simply the product of the magnitude or measure <& w in w by the mean recurrence time 
l/(p + . ) of the closure index or state (n+1). 

From these considerations, it therefore follows that if the substochastic system 
x(I- A) w (under the conditions, (I- A) regular, A nonnull, w positive) is of large 
scale, then it may be solved without requiring the inversion of matrices of large order 
and without effecting scale- reduction or consolidation of the system x(I- A) w [19] . 
Consolidation of the system may, however, be of interest for essentially empirical 
reasons (as in some 'resource flow 1 models) rather than for formal or computational 
considerations. The Markov chain analogy (which is a successive approximations me- 
thod) makes it possible to effect modifications in the matrix A (as well as In w) and 
also to utilize the (normalized) admissible solutions of earlier problems as initial vec- 
tors for the recursion in new problems. The matrix A^ of the closure exhibits the ba- 
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lanced margin property in the form D(x, w ) A^, and may be readily seen to be of the 
same form as the row stochastic matrix P considered earlier. The behavior of the 
powers of A W may be made evident by writing A W in the form: 

m \ A G n T II II U-A)g| 

AW I II I 

I w* 1 II I) O n -1 

From the Theorem, it is clear that certain 'balance 1 or 'equilibrium 1 problems can 
be stated in the formalism of finite- state time homogeneous Markov chains; indeed, 
many equilibrium conditions for models in statistical and mathematical economics 
follow as immediate consequences of the basic properties of balanced margin matrices. 
Conceptually, the stochastic process representation permits abstract formulations in- 
volving nonnegative square matrices to become as well models of "circulation" and 
"distribution" of abstract measure or "mass" [ 15] Such models may accordingly be 
taken to correspond to hypothetical random walks of elementary "mass" units (in some 
appropriate measure) on a directed graph or network in such manner that the potential 
statistical "observables", e.g., (x w , 0^), are viewed as the expected outcomes of 
large-number replications of motion on the graph. The "circular flow" and "recurrent 
event" representations of this aspect of the powers of nonnegative square matrices is 
of an abstract character and quite independent of immediate economic or accounting in- 
terpretations. The representation, however, has established roots in the classical po- 
litical economy (i.e., statistical economics) of the 18'th Century (in the Cantillon- 
Quesnay-Isnard tableau economique [21, 22 ] ). 
4. Aspects of Interaction. 

We consider next some intrinsic aspects of interaction from the standpoint of the re- 
presentation of complex systems. Consider first a system of recursions: 
(1) x m x ffl A^ + 25 x A R ( a, - 1, . . . rj k-1, 2, . . .) 



where the matrices A * are nonnegative and of dimension (n x n^ ) and the initial 

vectors x ( a * 1, . . . , r) are all nonnegative and nonnull; S n ft m. The ma- 

a a - 1 

trices A^ a ( f*a ) will be called interaction terms { , a} for the elements In- 

dexed a in the recursions; the matrices A ao , will be called reflexive terms { a, a j 
for the elements indexed a . 

Given the convention on indices, the system of recursions may be written in the form 
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(2) .XB (k-i.2....), 

where B = II A .1! { y, <5 = 1, . . . , r). Let B be an indecomposable and primitive 
matrix 1 7, 27] . Moreover, let B for convenience be a row stochastic matrix P * B. 
The system of recursions can then be viewed as sub- computations in a larger compu- 
tation or process in which the unique (left) stationary stochastic vector of P is to be 
determined as limit vector; the vectors x * " ' may be regarded as intermediate re- 

sults which are transmitted between elements a in the course of an infinite algorithm. 

(<r) 
Now let P be arbitrarily partitioned in a manner or so that P s II C^ II 

(j, k 1, . . . , p) where the matrices C.J* ' are of dimension (nj x n^). In analogous 
fashion, the matrices C J& ) (kJ*h) are called interaction terms and the matrices 

Ku 

C.y r ' reflexive terms. Let P be written in the form P - U +R , where U^ * is 
a block-diagonal matrix containing exclusively the reflexive terms C.. and R 
contains all interaction terms. 

Consider the nonnegative matrix T*^ defined as T^ -R (I-U^f 1 which 

exists by virtue of the inde compos ability of P, for we exclude the trivial cases of 

(o- ) (o-) 

U (the null matrix) and U P. The following relation then obtains: 

(3) F-ut*) + T (<r >-T (flr) U (<r) . 
It then follows that 

(4) (I-PXI-U^V* 1 (I-T* 00 ) 

holds. From the last relation (since P is indecomposable), it is evident that T^"^ is 
indecomposable with spectral norm unity. T^ ) and P then exhibit the same left sta- 
tionary vectors; T**' is not necessarily primitive. If T'*' is primitive, any system 
z* m 1 " T r (where z (0 * is nonnegative but not null) will yield the same (norma- 
lized) limit left stationary vector as the system x^ x* k ~^P. Depending on the par- 
tition <r and the structure of the graph of P, the limit vector may be determined by fi- 
nite algorithm with great computational efficiency and -without regard to primitivity. 

Consider next a system of Boolean recursions (governed by the rules of Boolean 
algebra): 

(5) #> - tt?- 1 ^..* a *$-\ a (./ - l,....r !k -1.2....,, 

where the relation matrices R y fi contain only elements or 1 and are of dimension 
(n y x n d ); J5 n ft - m. The Boolean relation matrices R^ a (B^ ) may similarly 
be called Interaction (reflexive) terms for the elements Indexed in the recursions. 
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The system of recursions may analogously be written in the concise form 
(6) $ {k) ?5 <fc " 1) R (kl,2,...), 

where R II R ^ II ( y , 6 * 1, . . . , r). The powers of Boolean relation matrices 
have been characterized in [ 17] . The following result (Lemma 5, ibid. ) is of general 
applicability in computation and the representation of finite processes [18] : If in the 
graph G(R) of a Boolean relation matrix R each vertex is connected to at least one ver- 
tex, then there exists at least one closed cyclic net in the graph and every vertex of 
G(R) is connected to one or more closed cyclic nets in G(R). If such a relation R were 
in addition many-one, then the graph G(R) would be composed of one or more disjoint 
subgraphs each terminating in a simple cyclic net ( [17]and *96 in [26] ). Each sub- 
graph containing a simple cyclic net of order one is said to contain a decisive terminus. 
5. Balance, Closure, and Interaction in Models of Complex Systems. 

In tills section, we exemplify the concepts of balance, closure, and interaction in 
some simple formulations of complex behavioral systems. 

a. Accounting Frameworks. 

We consider first what is, in effect, the oldest flow network or matrix framework 
for the description and control of subsystems; the framework was set forth in one of 
the first mathematical works to be printed (in the West) by Fra Luca Pacioli in the 
Summa de arithmettca, geometria, proportion!, e proportionally (Venice, 1494). (In 
the narrow sense of statistical economics, the following considerations apply to cor- 
porate accounting, managerial and cost accounting, and accounting in the individual 
domains of national income and product arrays, input-output flows, balance of pay- 
ments, and money-flows accounting. ) 

Consider a classification grid or grating imposed on the 'transactions 1 of an orga- 
nized 'entity 1 in which there exist m abstract collections; each collection is called an 
'account 1 . The square grid with row and column indices j (j 1, . . . , m) is called an 
articulation statement. The row and column aspects for 'account 1 j are respectively 
called 'debit j' and 'credit j 1 (j 1, . . . ,m). For some arbitrary f time period 1 and in 
some code of valuation or measure conventions, let x . > denote the total 'measured 
magnitude 1 of 'transactions' at once debited to 'account i 1 and credited to 'account j. 
It is conventional but not necessary that Xy be taken to be nonnegative and that 
i ji j (1, j si, . . .,m) in the preceding statement; in the following, then, xy may be 
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real and i j is admissible. 

For the square matrix X of order m, let 6^5 max [r., c. ] - r^ and 
y . s max ( r., c. ] - c. where r^, c. are respectively the i'th row sum and i'th col- 
umn sum (1*1,.,., m). The square matrix X of order (m+1) is called the effective 
closure of the matrix X of order m if X satisfies the following bordering conditions 
relative to X: 

i . ir 

II r o 

The effective closure X satisfies the Pacioli-Stevinus conditions identically and is a 
(generalized) balanced margin matrix. The row and column index (m+1) of X is asso- 
ciated with a collection called the 'balance-sheet account' with analogous row-debit 

and column-credit names. The (m+lj'st row and column of X are in fact equivalent to 

m m 

the balance-sheet account. Moreover, S 6| S 2J yj is the fundamental debit- 
credit (or conservation) identity of ail double-entry accounting. 

It is noteworthy that in closed mathematical or statistical economic models which 
are 'arithmetized 1 in terms of articulated arrays of accounts, the 'equilibrium' and 
'balance 1 concepts may become formally equivalent depending on the completeness of 
the 'closure 1 . In individual studies of resource flow systems, these concepts may have 
scientific or control utility where these possibilities may overlap. 

b. "Input-Output" Representations. 

We turn now to certain large-scale models of mathematical and statistical economics 
which are typically 'arithmetized 1 in accounting frameworks. These formulations in- 
clude certain inter industrial ('input-output') models, muitisector trade or exchange 
models, circulation models, and models of macroeconomic stability 12,10,18,24] . 
Since all these formulations in a definite sense constitute simple examples of complex 
systems, we will for convenience refer to these as 'input- output' models; 'input' and 
'output 1 may require redefinition from one model context to another. 

The framework of certain of these 'input- output' models may be set forth in the fol- 
lowing manner. At a given level of fine detail, a classification grid is imposed on the 
'transactions' of an 'economy' in a generalized sense. The representation of the 'eco- 
nomy' is taken to be sufficiently detailed to possibly afford the specification of interac- 
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tion terms for multiplicities of 'subeconomies'. There are taken to be m collections 
of 'entities 1 of analytic interest (e.g., establishments, firms, households, activities, 
etc.). Each collection is called a sector, industry, activity, or component for 'enti- 
ties 1 exhibiting a common property of analytic interest (e.g. , producing or consuming 
'similar' resources). A set of 'transaction' mass observables is assumed to be given 
for some definite historical period or is statistically averaged over some well-defined 
combination of time periods. The 'transaction' observables may be stated in terms of 
the behavioral valuation conventions of the analytic 'entities' or they may be further 
stated in terms of the consistent valuation conventions of the designer of the grid. Let 
x-j denote the valuation of purchases or procurements made by sector or activity i 

from sector or activity j; let x^. j> be called the value of input to sector or activity i 

m 
procured from sector or activity j. Let S x^ > be denoted by x^ and called 

h" 1 
the value of output of activity or sector i. Let ay denote the normalized value of input 

Xj . /x^ , and let it be called the unit input to sector or activity i from sector or activity 
J_| a.. (i = 1, . . . , m) may but need not be null so that valuations may be uniformly ex- 
pressed in 'gross' or in 'net' (i.e., a.. = for all i) terms. The sectors or activities 
may have an implicit or explicit time reference and some activities may but need not 
explicitly refer to 'investment 1 . 

The 'transactions' matrix X is constructed to be in effective closure form so that X 
is a balanced margin matrix, viz. , Xg X T g. Let A* be the row- normalized form of 
X where some or all of the matrix elements of A* are viewed as 'flow parameters 1 
(cf. Proposition 1). A* is a stochastic matrix and any proper principal submatrix A 
of A* is obviously substochastic; some proper principal submatrices A may but need 
not be stochastic. 

Substochastic models of the form x(I - A*) Q m , x(I - A) - w, for A any proper prin- 
cipal submatrix of A* and w nonnegative, are respectively called closed and open input- 
output models. In such models, interest centers on the admissible solutions of the sys- 
tems which are regarded as 'activity level' vectors. In fact, such vectors are expres- 
sed in valuation form relative to some base period reference point since any flow vec- 
tor $ w = (p n+ iw*(I - A)" , p +1 ) of the completion of a nonsingular system x(I - A)w 
involves components of the product form x.y. (j 1, . . .,n+l), cf. Proposition 3 and 
Theorem of Section 3. For models of this form, this fact was known to A. Isnard who 
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in his Trait/ des riches ses (Paris, 1781) essentially stated this result for a three-sec- 
tor normalized resource model [21] . There also exist so-called "valuation" prob- 
lems for such substochastic models which can be stated in analogous fashion and sol- 
ved as stochastic systems in closure form. It is claimed that the "activity level" and 
"valuation" problems cannot be simultaneously solved in the substochastic models; this 
is the case, for the two problems as usually stated involve incompatible assumptions, 
viz. , distinct Uncome distributions 1 by activity or industry of origin. Both the "activi- 
ty level" and "valuation" problems can be simultaneously solved in the simple closed 
model of production (or of 'dynamic economic equilibrium 1 ) in which it is required to 
'balance* an indecomposable matrix 

I 
oi 

with spectral norm p regarded as a 'growth' or 'reproduction* factor (cf. Proposition 
3); the vectors x, y of Proposition 3 are respectively called "activity level" and "valu- 
ation" vectors [63 . From the relations given in Proposition 3, it is clear that x, y 
and p may be computed by means of well established successive approximations pro- 
cedures. Immediately related to models of this type are closed 'period planning 1 mo- 
dels in which an indecomposable matrix A contains only nontrivial nonnegative matri- 
cc A a a (*flcive terms) and A^ ^ +1 (interaction terms) for a 1, . . . , r and also 
the interaction term A rl ; interest centers on the growth factor p and the left and right 
eigenvectors of A associated with p . 

From the preceding development, it is clear that these particular models and the 
issues which give rise to them can be equivalently formulated as balanced margin ma- 
trix problems or as constrained limit (possibly Cesaro limit) distribution problems for 
well-defined stationary Markov chains. In particular, consider an "activity level" 
problem for a nonslngular system x(I - A) - w, w positive and A J 0. The irreversible 
process (or computation) afforded by the Markov chain representation may clearly be 
regarded as a master phenomenological equation which entails individual sector "mass 
balance" at "equilibrium" and in which expected values are regarded as observables. 
In particular, the solution process is given by g fc+J - fc J^. i) m 6 fc ^ ^ 
closure of A <k - 0, 1, 2, . . . ). At "statistical equilibrium", 6 m - and the limit 
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vector p w 8 (p w*(I - A) , P n +i) i* independent of the initial vector. The scale 

factor W - wg, and the aggregate gross "multiplier" for the system 

n 
M S ( 23 Xj +d w )/ $ w is simply given by the mean recurrence time l/(p n+ ^) for 

the index or state (n+1) of the closure, where x w(I - A)" . The scalar relation 

n n 

x(I - A)g = wg w x n<|>1 when written in the form S (l-rj)Xj-x n+1 = 0(r^ 2 aj k ) 

is designated as the "technical production- possibility function" for the 'economy 1 in 
the input-output model since it is of invariant form for any 'bill of goods 1 or 'bill of 
final demand 1 as w is called in these models. 

We continue this treatment of 'input- output 1 models by considering next a simple pro- 
cess representation of certain accounting tableaus. We assume given a population, 
elements of which are capable of effecting one- step transitions from 'state i 1 to 'state 
ji Q 9 j = 1, . . ., N) in a manner prescribed by a discrete parameter time-homogeneous 
Markov chain with transition matrix P= II py II (i, j * 1, . . . , N). To each one- step 
transition from 'state I' to 'state j' (i. e. , t oj, <\, <*j vertices in the graph G(R p ) ) 
there is assigned a 'valuation' c^. as given in a real matrix C - II cy It (i, j !,... N). 
It is required to additively evaluate all possible k-step directed paths which have been 
traversed by the end of the k'th period (k 1,2,...). 

Let A* k * denote the (N x N) 'path valuation' matrix at the end of the k'th period. 
Since it is aear that A [1 J II a^ 11 U- II c^py U , A [2J - II a^ 2j H-n S (c lk +c^.)p lk p^ II , 

f31 [31 N N 

A "" a ij ' l "" S S ^k 
ncx" 1 k = 1 

algorithm may be established for the path valuation' 'computation'. 

ALGORITHM: A W -A^" 11 P + P Xl " 1 A [l1 . n 2, 3, . . . . This recursion, in effect, 
decomposes the 'path valuation 1 into two distinct components and evidently requires no 
valuation of individual paths. If the 'flow matrix 1 P were properly substochastic so 
that (I - P)" were to exist, the cumulative 'path valuation' would in the limit be simply 



1 A group-theoretic and stochastic process approach to a related set of resource mo- 
dels was given by the present writer in an unpublished paper entitled "On Some Struc- 
tural Properties of Distributions of Income" (Harvard University, 1947-48>, based on 
researches in the period 1943-47, conducted at the Division of Statistical Standards, 
U.S. Bureau of the Budget, and in other Federal research institutions, and cited in 
Studies in Income and Wealth, Volume 13, National Bureau of Economic Research, New 
York, 1951, pp. 385-386. Cf. also "The Distribution of Income and Consumer Behavior 
Representations" (abstract), Econometrica, Volume 19 (1951), pp. 334-335. 
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given as 

B fnl 1 f 1 1 1 

lim s A . (I - Pf A (I - Pf . 

s oo n 1 

The preceding algorithm can be concisely stated in the following form. Let T de- 
note the (2N x 2N) matrix 
P 

A 1 " P| 
The recursion or 'chain calculation' yields 

p n o 

A [=l 

The algorithm and recursion are obviously completely general in that they depend in 
no way on the properties of the matrices; thus A and P could be Boolean relation 
matrices and the algebraic rules Boolean in character. In the latter case, the Boolean 

oo N-l h 

relation matrix P = ^ p = p^ t j ie 'ancestral relation 1 of the Boolean 

oo . . m 

relation matrix P (cf. *91 of [26] ); thus in Boolean terms S A 1 J * P A L J P + . 

h -1 
We conclude this approach to 'input-output' formulations by considering a basic 

'communication model 1 [16} . Let R be an (n x n) Boolean relation matrix and desig- 
nated as the 'entity communication pattern' for entities placed in correspondence with 
the index set 1, . . . , n. Let S be a (p x p) Boolean relation matrix and designated as the 
'message implication (or communication) relation 1 for messages (or ideas) placed in 
correspondence with the index set 1, . . . , p. Let T be a (non-homogeneous) (n x p) 
Boolean relation matrix and designated as the 'initial message input tableau' with row 
index set referring to the entities' and column index set referring to 'messages'. 
Consider the 'communication model' defined by the Boolean recursive system 



where R is the converse (or transpose) of R. The recursive system specifies the "con- 
figuration state 11 of 'messages' held or known by the set of 'entities' in the course of 
system time. The recursion can be concisely stated in the following form. Let fl} denote 
the (n+pxn+ p) Boolean matrix . The recursion or 'chain calculation' yields 

Ho sli 

IR n U* 



S* 
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, n 2, 3, . . . . 



A classification of all 'communication 1 structures compatible with this model follows 
directly from the results given in [17 1 ; an algorithm for 'minimum transmission 
times of 'messages' to 'entities' " can be readily stated [16] . 

c. Information Logistics. 

We consider finally a simple model of information logistics which belongs to the 
mathematical theory of censuses and large-scale sample surveys; it may also be re- 
garded as a model of 'statistical audit*. In particular, we consider the outcome of in- 
formation-processing operations as a topic in the theory of production where the 'out- 
put' chain is intrinsically treated as a sequence of classifications through a serial sys- 
tem of grids. Each processing operation is taken to be basically governed by a stochas- 
tic 'operator 1 (matrix). The 'output' of processing operation k is regarded as the 'in- 
put' to processing operation k+1. The expected 'reliability 1 or 'reproducibility 1 of the 
intermediate and final products is of interest in this model. 'Reliability 1 or 'reprodu- 
cibility' is defined by certain balanced margin properties of expected joint distributions 
(or cross-tabulations) comparing the outcomes of first and second 'trial' of a given 
processing operation. 

We designate the original collection of (census or survey) data as the initial process- 
ing operation and assume there exist a total of r operations (j - 1, . . . , r). The first 
operation may be regarded as the result of interaction between respondents and observ- 
ers, where these may possibly coincide. Let the stochastic 'operator ' for operation 
j (j s 1, . . . , r) be denoted by * j, where * ^ is a row stochastic matrix (generally 
rectangular), with column frame conformal with the row frame of * ^+1 
(h m 1, . . . , r-1). Let Q denote a row stochastic matrix which maps respondent 
'crypto states 1 .,..., P * into response observable s a , . . . , a ; the 'crypto states 1 
are generally unobservable and may not correspond in nature to response observables. 

We assume given an initial distribution h over 'crypto states' for some set of respond- 
ents. We further assume that there exists an (L x L) stochastic matrix Q governing 
transitions between 'crypto states 1 for respondents in an interval between execution and 
repetition of the initial (census or survey) measurements. We prescribe conditional 
independence in all probability calculations. Since each of the processing operations 
is assumed to be repeated in sequence, we have in the first conjoint 'trial' the expecta- 
tion 

83 



r r 

h $ TT * j and in the second conjoint 'trial' the expectation h Q TT * j- 

The expected joint distribution for first and second 'trials' is then given by 

C(h)s HP TT *j T D(10Q 8 IT *j , where D(h) is a diagonal 

L j-l J L j 1 J 

matrix. We define reliability or reproducibility of the information-processing se- 
quence for h as the condition that the marginal distributions of C(h) be equal, i. e. , 

T 
C(h)g * C(h) g so that C(h) must be a balanced margin matrix. In order that C(h) be a 

balanced margin matrix, given any arbitrary initial distribution h, it is necessary and 



sufficient that 

r 
$ 7T 



< 



The basic consequence of the model then is that any expected joint distribution (or 
cross-tabulation) C(h) is necessarily symmetric. 
6. Conclusion. 

In this paper, we have considered the concepts of balance, closure, and interaction 
in the context of certain models of complex behavioral systems. In the study of com- 
plex systems, the remarks made by J. W. Gibbs in his address "On Multiple Algebra" 
(read before the American Association for the Advancement of Science, Section of 
Mathematics and Astronomy, 1886) (8 J may be relevant: "In mathematics, a part 
often contains the whole". 
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AN AGGREGATION PROBLEM FOR MARKOV CHAINS 
M. Rosenblatt 

Introduction. This Is an expository paper which has as its aim a brief discussion of 
some problems in probability theory that have recently attracted some attention. 
Markov chains are among the simplest examples of dependent processes and have been 
used in models of random phenomena arising in a variety of fields in the physical sci- 
ences, social sciences and elsewhere. Often one may observe or deal with a function 
of the chain rather than the chain itself. This might be due to the fact that the raw data 
is too extensive and so is reduced by lumping or collapsing classes of states for grea- 
ter ease in handling. It may also stem from the fact that the process of interest is not 
observed directly, rather only after it has been passed through some filter (possibly 
nonlinear). A natural question that arises is as to whether this function of the chain is 
Markovian itself. For if it were not (at least approximately) one might not be able to 
make use of the simple tools relevant when dealing with Markovian processes. In this 
paper some simple sufficient conditions are given under which the function of a Markov 
chain is itself a Markov chain. For simplicity, the problem is discussed in the context 
of a Markov chain rather than that of a general Markov process. 

Definitions and Problems. Consider a Markov chain X(n) (see [3 ] ) n 0, 1, 2, . . . 
with a finite number of states 1, 2, . . . , xn, stationary transition probability (one-step) 
matrix 

P= P (1) = ( P . J; i, J !..... m) 

p.. - PlXfn + l),, jiX(n)= i] > 

Z Pij-1 

and initial probability vector 
w* (w.) 
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The Markovian property amounts to 



Note that this implies that the matrix of n-step transition probabilities 



is given by 



We then automatically have the relation 



which is usually referred to as the Chapman- Kolmogorov equation. A natural problem 
that arises is as to when the Chapman- Kolmogorov equation implies the Markovian 
property. An example of Paul Levy that will be given later on indicates that this is not 
generally true. Nonetheless we shall see that this does hold under some special cir- 
cumstances. 

Suppose the experimenter does not observe the process X(n) but rather a derived 
process Y(n) f(X(n)) where f is a given function on 1, . . . ,m. The states i of the ori- 
ginal process on which f equals some fixed constant are collapsed into a single state of 
the new process Y(n). Call these collapsed sets of states S. i = 1, . . . , r, r < m. One 
would like to know whether the new process is Markovian. The following simple ex- 
ample indicates that this is generally not the case. Let the initial process be a Markov 
chain with three states 1, 2, 3, transition probability matrix 



(i 



r O 1/2 1/2 \ 
P * [ 1/3 1/4 5/12) 
2/3 1/4 1/12/ 



and initial probability vector 

w (1/3, 1/3. 1/3). 

Collapse the states 1, 2 into the set of states S. Then 
P(X(n+2US, X(n + l)S|X(n)CS) =29/96 
* P(X(n + l)eS|X(n)S) 2 * (13/24) 2 
so that the new process is not Markovian. 

An Example of P. Levy. P. Levy has given an example of a non- Markovian process 
whose transition probabilities satisfy the Chapman- Kolmogorov equation [4] . He has 
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called such processes pseudo-Markovian. The original process given by Levy has a 
continuous state space but it is very easy to modify his construction and obtain a pro- 
cess with a discrete state space. 

Consider a second order Markov process Y(n) with m states, m > 3, and second or- 
der transition probability 

-U, Y(n) 






TS- (2tt 2- u i- u o>rt 



u , u , u. 0, 1, . . . ,m-l. 
o l 6 

Of course, the two-dimensional process 

X(n).(Y(n), Y(n-l)) 

is a Markov process (first order) in the sense specified in section one. But the pro- 
cess Y(n), a function of X(n), is not first order Markovian and yet its first order tran- 
sition probabilities satisfy the Chapman- Kolmogorov equation if the initial probability 
distribution of Y(n) is the uniform distribution P(Y(n) - u) - 1/m. The computations 
required to verify this parallel those given in Levy's note [4 ] . 

Discussion of Problems. We shall discuss some of the results that have been ob- 
tained but shall not give proofs. Most of the proofs can be found in [2 ] . 

Often we are interested not in any initial probability distribution w but rather one, 
p = (p.), that is a left invariant vector of the matrix P 

pP. p. 

The process X(n) is then stationary, that is, its probability structure is invariant un- 
der time displacement. Let D be the diagonal matrix with its i diagonal entry p.. 
The Markov chain X(n) is said to be reversible if 

DP - P'D 

(P 1 is the transpose of P). This means that its probability structure going backwards 
in time is the same as that going forwards in time. 

Theorem 1. Let X(n) be a stationary reversible Markov chain with p. > for all i. 
Then Y(n) f(X(n)) satisfies the Chapman- Kolmogorov equation if and only if 
Pij = PIX(n+lKS |X(n)i] - C g g 

has the same value for all i in any given collapsed set of states S , l,...,r. 
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In this case the condition that Y(n) satisfy the Chapman- Kolmogorov equation is equi- 
valent to the condition that Y(n) be Markovian. 

It would be interesting to obtain a neat set of necessary and sufficient conditions for 
a process Y(n) = f(X(n)), derived from a stationary Markov chain X(n) (not necessarily 
reversible), to be Markovian. 

A somewhat different problem can be phrased in the following way. Let w = (w^), 
w. > 0, i 1, . . . , m be any initial probability distribution. Consider the Markov chain 
X(n) generated by initial distribution w and transition probability matrix P. Again con- 
sider Y(n) = f(X(n)) and require that Y(n) be Markovian whatever the initial distribution 
w. 

Theorem 2. Given any set of states S generated by f, assume there are at least 
two states igS^, i'S ai , aJa 1 such that 



P i',S fi > ' 

P 
Then a necessary and sufficient condition that Y(n) be Markovian, whatever the initial 

distribution w of the Markov chain X(n), is noted below. Given any set of states S 

---_ _____ _ __^____ p 

generated by f 



for all i and y . 

Examples satisfying the conditions of Theorem 2 but not those of Theorem 1 are gi- 
ven in [2] , [7] . 

We shall now see that the Markov chains X(n) for which f(X(n)) is Markovian, what- 
ever f may be, are of a degenerate form. Of course, a very strong condition is now 
imposed on X(n). 

Theorem 3, Let X(n) be a stationary Markov chain with p > i 1, .... m. f(X(n)) 
satisfies the Chapman- Kolmogorov equation for every many- one transformation f if and 
only if the transition probability matrix P of X{n) is of the form 



U is a matrix with identical rows and a is a real number. Then f(X(n)) is Mar- 
kovian for any f . 

Let us now consider the case of a decent continuous time parameter Markov chain 
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with a finite number of states. 

Theorem 4. Let X(t) 0< t < be a Markov chain with a finite number of states 
i 1, . . . , m and stationary transition probability function 



Pij (t)P[X(t + T)j|Xfr)-l]. 0< t < , 
continuous in t. Assume that 

lim P(t) - 1. 
Clearly 

P(t)P(s)P(t+s) t,s> 0. 

Let the initial probability distribution of X(t) be w, w. > Q, i 1, . . . , m. Then 
Y(t) f(X(t)) satisfies the Chapman- Kolmogorov equations, whatever the initial distri- 
bution w of X(t), if and only if for each ft 1, . . . , r separately either 

(t) P. c (t) 35 for all i < 
*>"/} 




Chapman- Kolmogorov equation is equivalent to the condition that Y(t) be Markovlan. 

Bush and Mosteller [1 ] considered a problem similar to that of Theorem 3 and ob- 
tained the same result. B. Rankin [5] has discussed the condition of Theorem 1 as a 
sufficient condition for the collapsed f(X(n)) to be Markovian. One should also note that 
the conditions of Theorem 1 and Theorem 3 arise in aggregation problems as they oc- 
cur in economics (see D. Rosenblatt [ 6 ] ). 

This research was supported by the Office of Naval Research. Reproduction in whole 
or in part is permitted for any purpose of the United States Government. 
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CODING THEOREMS FOR A DISCRETE SOURCE WITH A FIDELITY CRITERION 1 

Claude E. Shannon 

In this paper a study is made of the problem of coding a discrete source of informa- 
tion, given a fidelity criterion or a measure of the distortion of the final recovered 
message at the receiving point relative to the actual transmitted message. In a parti- 
cular case there might be a certain tolerable level of distortion as determined by this 
measure. It is desired to so encode the information that the maximum possible signal- 
ing rate is obtained without exceeding the tolerable distortion level. This work is an 
expansion and detailed elaboration of ideas presented earlier [ 1] , with particular re- 
ference to the discrete case. 

We shall show that for a wide class of distortion measures and discrete sources of 
information there exists a function R(D) (depending on the particular distortion mea- 
sure and source) which measures, in a sense, the equivalent rate R of the source (in 
bits per letter produced) when D is the allowed distortion level. Methods will be given 
for evaluating R(D) explicitly in certain simple cases and for evaluating R(D) by a limi- 
ting process in more complex cases. The basic results are roughly that it is imposs- 
ible to signal at a rate faster than C/R(D) (source letters per second) over a memory- 
less channel of capacity C (bits per second) with a distortion measure less than or 
equal to D. On the other hand, by sufficiently long block codes it is possible to ap- 
proach as closely as desired the rate C/R(D) with distortion level D. 

Finally, some particular examples, using error probability of message letters and 
other simple distortion measures, are worked out in detail . 
The Single-Letter Distortion Measure 

Suppose that we have a discrete information source producing a sequence of letters 
or "word" M - xn^m^m^, . . . ,m t , each chosen from a finite alphabet. These are to 
be transmitted over a channel and reproduced, at least approximately, at a receiving 
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point. (Here and throughout the paper, the words "reproduced" and "reproduction" do 
not imply an exact correct copy but allow the possibility of errors. ) Let the reproduced 
word be Z * Zj,z , . . . , z . The z letters may be from the same alphabet as the m^ 
letters or from an enlarged alphabet including, perhaps, special symbols for unknown 
or semi-unknown letters. In a noisy telegraph situation M and Z might be as follows: 
M - I HAVE HEARD THE MERMAIDS SINGING. . . 
Z IH?VTHEA?D TSE B7RMAID2? 77NGING. . . 

In this case, the z alphabet consists of the ordinary letters and space of the m alphabet, 
together with additional symbols "7", "A", "B" , etc. , indicating less certain identifi- 
cation. Even more generally, the z alphabet might be entirely different from the m 
alphabet. 

Consider a situation in which there is a measure of the fidelity of transmission or 
the "distortion" between the original and final words. We shall assume first that this 
distortion measure is of a very simple and special type and later we shall generalize 
considerably on the basis of the special case. 

A single-letter distortion measure is defined as follows. There is given a matrix 
d^ with d^ 0. Here i ranges over the letters of the m alphabet of, say, a letters 
(assumed given a numerical ordering), while j ranges over the z alphabet of b letters. 
The quantity d y may be thought of as a "cost" if letter i is reproduced as letter j. 

If the z alphabet includes the m alphabet, we will assume the distortion between an 
m letter and its correct reproduction to be zero and all incorrect reproductions to 
have positive distortion. It is convenient in this case to assume that the alphabets are 
arranged in the same indexing order so that ^ - 0, d. > (i j j). 

The distortion D, if word M is reproduced as word Z, is to be measured by the ave- 
rage of the individual letter distortions: 



If in a communication system, word M occurs with probability P(M) and the condition- 
al probability, if M is transmitted, that word Z will be reproduced is P(ZlM), then 
we assume that the over-all distortion of the system is given by 



D S P(M) P(Z| M) D(M, Z) 
M,Z 
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Here we are supposing that all messages and reproduced words are of the same length 
t. In variable- length coding systems the analogous measure is merely the over-ail 
probability that letter i is reproduced as j, multiplied by dj. and summed on i and j. 
Note that D if and only if each word is correctly reproduced with probability 1 (in 
cases where the z alphabet includes the m alphabet), otherwise D > 0. 
Some Simple Examples 

A distortion measure may be represented by giving the matrix of its elements, all 
terms of which are non-negative. An alternative representation is in terms of a line 
diagram similar to those used for representing a memoryless noisy channel. The lines 
are now labeled, however, with the values d.j rather than probabilities. 

A simple example of a distortion measure, with identical m and z alphabets, is the 
error probability per letter. In this case, if the alphabets are ordered similarly, 
dy - 1 - 6y, (6 y 1 if i - j; if i j). If there were three letters in the m and z al- 
phabets, the line diagram would be that shown in Fig. l(a). 




emergency 




all's well 



(a) (b) 

Fig. 1 

Such a distortion measure might be appropriate in measuring the fidelity of a teletype 
or a remote typesetting system. 

Another example is that of transmitting the quantized position of a wheel or shaft. 
Suppose that the circumference is divided into five equal arcs. It might be only half as 
costly to. have an error of plus or minus one segment as larger errors. Thus the dis- 
tortion measure might be 

f i-j 

cLj - 4 1/2 i - j 1 mod 5 
( 1 i - j > 1 mod 5. 

A third example might be an alarm system sending information each second, either 
"all's well" or "emergency, " for some situation. Generally, it would be considerably 
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more important that the "emergency" signal be correctly received than that the "all's 
well" signal be correctly received. Thus if these were weighted 10 to 1, the diagram 
would be as shown in Fig. l(b). 

A fourth example with entirely distinct m and z alphabets is a case in which the m 
alphabet consists of three possible readings, -1, 0, and + 1. Perhaps, for some rea- 
sons of economy, it is desired to work with a reproduced alphabet of two letters, - - 
and + I. One might then have the matrix that is shown in Fig. 2. 

1 +J 

2 2 



-1 


1 



1 2 

1 1 



2 1 

Fig. 2. 
The Rate-Distortion Function R(D) 

Now suppose that successive letters of the message are statistically independent but 
chosen with the same probabilities, P. being the probability of letter i from the alpha- 
bet. This type of source we call an independent letter source. 

Given such a set of probabilities P. and a distortion measure d. ., we define a rate- 
distortion function as follows. Assign an arbitrary set of transition probabilities 



for transitions from i to j. (Of course, q^j) >0 and S q.(j) 1. ) One could calculate 
for this assignment two things: first, the distortion measure D(q.(j))* 2 P.q^jjd.. 
if letter i were reproduced as j with conditional probability qj(j), and, second, the 
average mutual information between i and j if this were the case, namely 

,JS2> 1 

I!*- 



P 



The rate-distortion function R(D*) is defined as the greatest lower bound of R(q.(j)) 
when the q.(j) are varied subject to their probability limitations and subject to the ave- 
rage distortion D being less than or equal to D*. 

Note that R(q.(j)) is a continuous function of the q.(j) in the allowed region of varia- 
tion of q^O) which is closed. Consequently, the greatest lower bound of R is actually 
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attained as a minimum for each value of R that can occur at all. Further, from its de- 
finition it is clear that R(D) is a monotonically decreasing function of D. 

The rate distortion function may be thought of as follows. Imagine various possible 
memoryless channels from the message alphabet to the recovered letter alphabet. A 
particular one of these corresponds to a choice of the q.(j) and may be called a test 
channel. For a particular test channel we calculate the average rate of transmission 
(average mutual information) and also the average distortion if this test channel were 
used without coding. The rate distortion function R(D) is the minimum, rate for all 
possible test channels that have an average distortion not exceeding D. 
Convexity of the R(D) Curve 

Suppose that two points on the R(D) curve (that is, the curve R(D) plotted as a func- 
tion of D) are (R, D) obtained with assignment q.(j) and (R 1 , D 1 ) attained with assignment 
q(j). Consider a mixture of these assignments X q t (j) + (1 -X)q|(j) with *X 1. 
This produces a D 11 (because of the linearity of D) equal to X D+ (1 - X)D'. On the 
other hand, R(q (j)) is known to be a convex downward function (the rate or mutual in- 
formation for a channel as a function of its transition probabilities). Hence 
R" < XR + (I -A JR 1 . The minimising q"(j) for D" must give at least this low a value 
of R u . Hence the curve R as a function of D is convex downward. Conversely, because 
of its monotonicity, D as a function of R is convex downward. 

The minimum possible D value clearly occurs if, for each i, q t (j) is assigned the 



value 1 for the j having the TntH*?*"" d... Thus, the lowest possible D is given by 
D ln " * *l m f d lj 

If the m alphabet is imaged in the z alphabet, then D min " O v and the corresponding R 
value is the ordinary entropy or rate for the source. In the more general situation, 
R(D . ) may be readily evaluated by evaluating R for the assignment mentioned, if 
there is a unique min d y . Otherwise, the evaluation of R(I> min ) i bit more complex. 

On the other hand, R = is obtained if and only if q^j) * Q., a function of j only. 
This is because an average mutual information is positive unless the events are inde- 
pendent. For a given Q giving R 0, the D is then S p^Q d^- 2 Q 2 P d . 

The minimum D for R clearly would result from finding a j that gives a minimum 
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S PA. (*ay J*) and making Q . 1. This can be done by assigning q^j*) 1 (all 
other q.(j) are made 0). 

Summarizing, then, R(D) is a convex downward function as shown in Fig. 3 running 

from R(D . ) at D - P. min d.. to zero at D av . - min Z P d . It is con- 
* mm' min i l j l ^ j i J 

tinuous (R a function of D or D a function of R) interior to this interval because of its 



D 



mn 



convexity. For 



. 



*min S P d.. 



Fig. 3 

, we have R 0. The curve is strictly monotonical.iy de- 



creasing from D . to D . Also, it is easily seen that in this interval the assign- 
mm max 

ment of q.(j) to obtain any point R(D*) must give a D satisfying the equality D D* 
(not the inequality D< D*). (For D*> D max the inequality will occur for the minimiz- 
ing q,(j). ) Thus the minimizing problem can be limited to a consideration of minima 
in the subspace where D D*, except in the range D*> D (where R(D*) 0). 

The convex downward nature of R as a function of the assigned q.(j) is helpful in 
evaluating the R(D) in specific cases. It implies that any local minimum (in the sub- 
space for a fixed D) is the absolute minimum in this subspace. For otherwise we 
could connect the local and absolute minima by a straight line and find a continuous 
series of points lower than the local minimum along this line. This would contradict 
its being a local minimum. 

Furthermore, the functions R(q l (j)) and D(q l (j)) have continuous derivatives interior 
to the allowed q.(j) set. Hence ordinary calculus methods (e.g., Lagrangian multi- 
pliers) may be used to locate the minimum. In general, however, this still involves 
the solution of a set of simultaneous equations. 
Solution for R(D) in Certain Simple Cases 

One special type of situation leads to a simple explicit solution for the R(D) curve. 
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Suppose that all a input letters are equiprobable: P. * I/a. Suppose further that the 
d. . matrix is square and is such that each row has the same set of entries and each 
column also has the same set of entries, although, in general, in different order. 

An example of this type is the positioning of a wheel mentioned earlier if ail posi- 
tions are equally likely. Another example is the simple error probability distortion 
measure if all letters are equally likely. 

In general, let the entries in any row or column be dj, d^, d , . . . , d . Then we shall 
show that the minimizing R for a given D occurs when all lines with distortion assign- 
ment d are given the probability assignment 



Here p is a parameter ranging from to " which determines the value of D. With 
this minimizing assignment, D and R are given parametrically in terms of p : 
-pd 



S 



T 



R - log - 1 - - pD 
Z e - Pd i 



When p = it may be seen that D - Z d. and R 0. When p - , D * 

a i 

and R >log 7 where k is the number of d with value d . 
k i min 

This solution is proved as follows. Suppose that we have an assignment q t (j) giving 
a certain average distortion and mutual information, say D* and R*. Consider now a 
new assignment where each line with d.. value d is assigned the average of the 
assignments for these lines in the original assignment. Similarly, each line labeled 
d> is given the average of all the d^ original assignments, and so on. Because of the 
linearity of D, this new assignment has the same D value, namely D*. The new R is 
the same as or smaller than R*. This is shown as follows. R may be written in terms 
of entropies as H(m) - H(m|z). H(m) is not changed, and H(miz) can only be increased 
by this averaging. The latter fact can be seen by observing that because of the convex- 
ity of - S x. log x. we have 
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- Z . 2. {*> Iogx{ t} * - Z f Z er.W )log S a J*> 
j j t j j t\j JJ / j 

where for a given t, x.' is a set of probabilities, and a, is a set of weighting factors. 



In particular 



. rl_S 



() 




j Z qJ'J t 
,i 

where q. is the original assignment to the line of value d, from letter s. But this 
inequality can be interpreted on the left as H(m| z) after the averaging process, while 
the right-hand side is H(m|z) before the averaging. The desired result then follows. 

Hence, for the minimizing assignment, all lines with the same d value will have 
equal probabilities. We derate these probabilities by q., each q. corresponding to a 
line labeled d . The rate R and distortion D can now be written 

D- Z q. d 
i l 



R log a, + Z q log q 
i i l 



since all received letters are now equiprobable, and H(m) * log a, H(m| z) 

Z q. log qu. 

i n n 

We wish, by proper choice of the q , to minimize R for a given D and subject to 

Z q. 1. Using Lagrangian multipliers, 
i n 

U - log a + q. log q + p Z qd + ji Z q. 
i l i n l i ^ 

-~- l-Hogq i +pd l *At0 
which yields the solution 



v 



,A e 



where A must be chosen to atlfy the aide condition 2 <JL * 1. This requires 



We then have a stationary point and by the convexity properties mentioned above it 
must be the absolute minimum for the corresponding value of D. By substituting this 
probability assignment in the formulas for D and R we obtain the results stated above. 
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Rate for a Product Source with a Sum Distortion Measure 

Suppose that we have two independent sources with respective distortion measures 
d and d^., and resulting rate distortion functions &,(*>,) and *M D 2^' Suppose that 
each source produces one letter each second. Considering ordered pairs of letters as 
single letters the combined system may be called the product source. If the total dis- 
tortion is to be measured by the sum of the individual distortions, D D. + D_, then 
there is a simple method of determining the function R(D) for the product source. In 
fact, we shall show that each coordinate of R(D) is obtained by adding the respective 
coordinates of the curves R (D.) and R_(D.) at points on the two curves having the 
same slope. The set of points obtained in this manner is the curve R(D), Further- 
more, a probability assignment to obtain any point of R(D) is the product of the assign- 
ments for the component points. 

We shall first show that given any assignment q. .,(j, j 1 ) for the product source, we 
can do at least as well in the minimizing process using an assignment of the form 
q (j)q t t (j l ) where the q and q 1 are derived from the given q. ^t(j J 1 )* Namely, let 



These are non-negative and, summed on j and j 1 , respectively, give 1, so they are 
satisfactory transition probabilities. Also the assignment q^(j) q,{j') gives the same 

total distortion as the assignment a (j, j 1 ). The former is 

it i 



* PI 



*> * t J> J 



* lt f jf 



which is the distortion given by the assignment q i ^(j* j 1 )- 

On the other hand, the mutual information R is decreased or left constant If we use 
q (j) q (j 1 ) instead of a ,(j j 1 )- ^ ** ct ^1* average mutual information can be writ- 
ten in terms of entropies as follows (using asterisks for entropies with the assignment 
q (j) qjjCJ 1 ) and none for the assignment q lf (j j 1 )). We have 
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R , H(i,i')-H(i,i'|j,j') 
3* H(i,i)-H(i|j,j)-H(i|j,j) 
* H(i,i')-H(i|j)-H(i'|j') 



Here we use the fact that with our definition of q.(j) and qJ.O 1 ) we have P*(i|j) - P(i|j) 
and P*(i' |j') P(i> |j>). (This follows immediately on writing out these probabilities. ) 
Now, using the fact that the sources are independent, H(i,i') *H(i)+H(i l )H*(i)+H*(i l ). 
Hence our last reduction above is equal to R*. This is the desired conclusion. 

It follows that any point on the R(D) curve for the product source is obtained by an 
independent or product assignment q.(j) q? t (j') And consequently (since with independent 
assignments both R and D are additive) is the sum in both coordinates of a pair of 
points on the two curves. The best choice for a given distortion D is clearly given by 

R(D) min [ R^t) + R^D - t)J 
and this minimum will occur when 



Thus the component points to be added are points where the component curves have the 
same slope. The convexity of these curves insures the uniqueness of this pair for any 
particular D. 
The Lower Bound on Distortion for a Given Channel Capacity 

The importance of the R(D) function is that it determines the channel capacity re- 
quired to send at a certain rate and with a certain minimum distortion. Consider the 
following situation. We are given an independent letter source with probabilities P t 
for the different possible letters. We are given a single-letter distortion measure d.. 
which leads to the rate distortion function R(D). Finally, there is a memoryless dis- 
crete channel K of capacity C bits per channel, or for each usage of the channel. We 
wish to transmit words of length t, that is, containing t letters, from the source over 
the channel with a block code. The length of the code words in the channel is n, (that 
is, n channel letters). What is the lowest distortion D that might be obtained with a 
code and a decoding system of this sort? 

Theorem I. Under the assumptions given above it is not possible to have a code with 
distortion D smaller than the (minimum) D* satisfying 
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R(D*) = - C 

-1 n -1 

or, equivalently, in any code, D > R ( | C), where R is the function inverse to 
R (take R" 1 (0) to be 



This theorem, and a converse positive result to be given later, show that R(D) may 
be thought of as the equivalent rate of the source for a given distortion D. Theorem 1 
asserts that for the distortion D and t letters of text, one must supply in the channel at 
least t R(D) total bits of capacity spread over the n uses of the channel in the code. 
The converse theorem will show that by taking n and t sufficiently large and with suit- 
able codes it is possible to approach this limiting curve. 

To prove Theorem 1, suppose that we have given a block code which encodes all 
message words of length t into channel words of length n and a decoding procedure for 
interpreting channel output words of length n into reproduced Z words of length t. Let 
a message word be represented by M m , m , . . . ,m . A channel input word is 

X = x ,x , . . . ,x . A channel output word is Y y., y . . . . , y and a reproduced, or Z 
* 2 n i 2 n 

word is Z z , 2 , . . . , z t . By the given code and decoding system, X is a function of 
M and Z is a function of Y. The m. are chosen independently according to the letter 
probabilities, and the channel transition probabilities give a set of conditional proba- 
bilities F(y|x) applying to each x^, y^ pair. Finally, the source and channel are inde- 
pendent in the sense that P(YJM, X) -P(YJX). 

We first derive the known result that H(M| Z)> H(M) - nC. We have that 
H(M j Z) >H(M ! Y) (since Z is a function of Y) and also that H(M ! Y) H(X | Y) - H(X) 
+ H(M). This last is because, from the independence condition above, H(Y|M, X) 
- H(Y| X), so H(Y, M, X) - H(M, X) - H(X, Y) - H(X). But H(M, X) H(M), since X is a 
function of M, and for the same reason H(M, X, Y) H(M, Y). Hence, rearranging, we 
have 

H(X, Y) H(M, Y) + H(X) - H(M, X) 

- H(M, Y) + H(X) - H(M) 
H(X ! Y) - H(M ! Y) + H(X) - H(M) 

Here we used H(M, X) - H(M) and then subtracted H(Y) from each side. Hence 
H{M!Z) >H(X|Y) - H(X)+H(M). 

Now we show that H(X{ Y) > H(X) - nC. This follows from a method we have used in 
other similar situations, by considering the change in H(X; Y) with each received let- 
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ter. Thus (using Y k for the first k y- letters, etc. ), 



H(X, Y fe ) - H(Y k ) - H(X, Y k , y ) + H(Y 



Here we used to obtain the fourth line, the fact that the channel is memorylesa, so 
P(y k +ll X 'V- P y k + ll*fc + l> * ^erefore H(y k+ jjX.T^ -H(y k + 1 |^ [ + l ). 
Finally, C is the maximum possible H(y) - H(yjx) giving the last inequality. 

Since the incremental change in H(X| Y k ) is bounded by C, the total change after n 
steps is bounded by nC. Consequently, the final H(XJY) is at least the initial value 
H(X) less nC. 

H(M|Z) H(XfY) - H(X) + H(M) 

>H(X>- nC-H(X) + H(M) 
H(M|Z)*H(M)- nC 

We now wish to overbound H(M| Z) in terms of the distortion D. We have 
H(M|Z) - HCmj 2 - i t | z x z 2 . . . z t ) 
< S H(m ,m ,...,in I a.) 

. A C. t I 1 



. 

Here we used the fact that additional conditioning variables decrease ytropy and also 
the entropy of a joint event is as large as that of any one of the events/. The quantity 
H^) - HCmJ t ) is the average mutual information between original message letter 
m and reproduced letter z { . If we let D t be the distortion between these letters, then 
RCD^ (the rate-distortion function evaluated for this D.) satisfies 

R(D.) ^H^) - H(mJ zj 

since Rp^ is the minimum mutual information for the distortion D. with any probabi- 
lity assignment. Hence our inequality may be written 
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t t 

H(M|Z) s H(m.)- s R^) 

Using now the fact that R(D) is a convex downward function, we have 
H(M|Z)< S H(xn.)-tRf 2 El \ 

i l V i * ) 
D i 

But _ m D, the over-all distortion of the system, so 
i t 

H(M|Z) < S H(m.)- tR(D) 

Combining this with our previous lower bound and using the independent letter assump- 
tion, so H(M) S H(m ), we have 

H(M) - nC < H(M) - t R(D) 
nC < t R(D) 

Since R(D) is monotone decreasing, this requires that D ^D*, the value resulting in 
equality nC t R(D*). This is essentially the result stated in Theorem 1. 

It should be noted that the result in the theorem is an assertion about the minimum 
distortion after any finite number n of uses of the channel. It is not an asymptotic re- 
sult for large n. Also, it applies, as seen by the method of proof, for any code, block 
or variable length, provided only that after n uses of the channel, t (or more) letters 
are produced at the receiving point, whatever the received sequence may be. 
The Coding Theorem for a Single- Letter Distortion Measure 

We now prove a positive coding theorem corresponding to the negative statements of 
Theorem 1; namely, that it is possible to approach the lower bound of distortion for a 
given ratio of number n of channel letters to number t of message letters. We consider 
then a source of message letters and a single-letter distortion measure d^. More 
generally than Theorem 1, however, this source may be any ergodic source; it is not 
necessarily an independent letter source. This more general situation will be helpful 
in a later generalization of the theorem. For an ergodic source there will still, of 
course, be letter probabilities P., and we could determine a rate distortion function 
R(D) based on these probabilities as though it were an independent letter source. 

We first establish the following result. 

.J. Suppose that we have an ergodic source with letter probabilities Pj, a 
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single-letter distortion measure d.., and a set of assigned transition probabilities 
such that 

S P t q^j) d tj D* 



Jj PI 

k 

Let Q(Z) be the probability measure of a sequence Z in the space of reproduced se- 
quences of length t if successive source letters had independent transition probabilities 
q,(j) into the Z alphabet. Then, given any c > 0, for all sufficiently large block 
length t, there exists a set a of messages of length t from the source with total source 
probability P(ot) >1 - c , and for each M belonging to a a set of Z sequences of 
length t, say ft , such that 



1) D(M, Z) <D* + forUGa and Z 

M 

M 
In other words, and somewhat roughly, long messages will, with high probability, 

fall in a certain subset a . Each member M of this subset has an associated set of Z 
sequences j^. The members of ft j^ have only (at most) slightly more than D* dis- 
tortion with M and the logarithm of the total probability of ^ in the Q measure is al- 
most -t R. 

To prove the lemma, consider source blocks of length t and the Z blocks of length t. 
Consider the two random variables, the distortion D between an M block and a Z block 
and the (unave raged) mutual information I(M, Z) given by: 

D - I f Si 



' i S log 
Q(Z) * i Q( Zi ) 

Here m^ is the i letter of a source block M, and z. is the i letter of a Z block. 
Both I and D are random variables, taking on different values corresponding to differ- 
ent choices of M and Z. They are both the sum of t random variables which are iden- 
tical functions of the joint M Z process except for shifting along over t positions. 

Since the joint process is ergodic, we may apply the ergodic theorem and assert 
that when t is large, D and I will, with probability nearly 1, be close to their expected 
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values. In particular, for any given c and 6 , if t is sufficiently large, we will have 
with probability 1 - ^- that D < S P { q^) d.. + l = D*+ t r Also, with pro- 

. 2 
bability at least 1 - we will have 

. R(D*) + t 

Let Y be the set of (M, Z) pairs for which both inequalities hold. Then P(y)> 1 - 6 2 
because each of the conditions can exclude, at most, a set of probability 5 12. Now 

for any M define .. as the set of Z such that (M,, Z) belongs to Y . 

1 M i * 

We have 



on a set a of M whose total probability satisfies P(a) > 1 - <5 . This is true, since if it 
were not we would have a total probability in the set complementary to 7 of at least 
6 . <5 -6 2 , a contradiction. The first 6 would be the probability of M not being in a , 
and the second 6 the conditional probability for such M's of Z not being in 0^. The 
product gives a lower bound on the probability of the complementary set tor . 




R(D*) + 

-t(R(D*) + 
' * 

Sum this inequality over all Z M 

Q(^ M )- = Q(Z)*e 

2fi)S M Z ^^M 

M l M l 

If M ^ a then S P(2|M ) > 1 - 5 as seen above. Hence the inequality can be 

1 Z6/? M 

M l 

continued to give 



We have now established that for any 1 > and 6 > there exists (for sufficiently 
large block length t) a set a of M f s and sets of Z's defined for each M in a with 
the three properties 

1) Fr(a)l-6 
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2) D(Z, M) < D* + j if M a , Z ? u 

3) Q(0 M )* U-6)e" t(R " l ' il) if M* 

provided that the block length t is sufficiently large. Clearly, this implies that for any 
c > and sufficiently large t we will have 

1) Pr(a) > 1 - c 

2) D(Z, M) < D* + c if 



3) Q( M ) * e if Mk a 

since we may take the c and 6 sufficiently small to satisfy these simplified conditions 
in which we use the same c . This concludes the proof of the lemma. 

Before attacking the general coding problem, we consider the problem indicated 
schematically in Fig. 4. 



ergodic source 
R(D) 




coder 




, 

r 




M 


X 




Z 



H(X) <R(D*> + < average distortion 

with M is < D* 
Fig. 4. 

We have an ergodic source and a single-letter distortion measure that gives the rate 
distortion function based on single letter probabilities R(D). It is desired to encode 
this source by a coder into sequences X in such a way that the original messages can be 
reproduced by the reproducer with an average distortion that does not exceed D* (D* be- 
ing some fixed tolerable distortion level). We are considering here block coding de- 
vices for both boxes. Thus the coder takes as input successive blocks of length t pro- 
duced by the source and has, as output, corresponding to each possible M block, a 
block from an X alphabet. 

The aim here is to do the coding in such a way as to keep the entropy of the X sequen- 
ces as low as possible, subject to this requirement of reproducibility with distortion D* 
or less. Here the entropy to which we are referring is the entropy per letter of the ori- 
ginal source. (Alternatively, we might think of the source as producing one letter per 
second and we are then interested in the X entropy per second. } 

We shall show that for any D* and any > coders and reproducers can be found that 
are such that H(X) R(D*)+ c . As c -~0 the block length involved in the code in general 
increases. This result, of course, is closely related to our interpretation of R(D*) as 
the equivalent rate of the source for distortion D*. It will follow readily from the fol- 
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lowing theorem. 

Theorem 2. Given an ergodic source, a distortion measure dy f and rate distortion 

function R(D) (based on the single-letter frequencies of the source), given D* * D . 

min 

and 6 > 0, then for any sufficiently large t there exists a set A containing N words of 
length t in the z alphabet with the following properties: 
1) -log N < R(D*) + 6 

Z) The average distortion between an M word of length t and its nearest 
(i. e. , least distortion) word in the set A is less than or equal to D* + 6 . 
This theorem implies (except for the 6 in property (2) which will later be eliminated) 
the results mentioned above. Namely, for the coder, one merely uses a device that 
maps any M word into its nearest member of A . The reproducer is then merely an 
identity transformation. The entropy per source letter of the coded sequence cannot 
exceed R(D*) + 6 , since this would be maximized at - log N if all of the N members of 
A were equally probable and log N is by the theorem to be less than or equal to 
R(D*) + 6 . 

This theorem will be proved by a random coding argument. We shall consider an en- 
semble of ways of selecting the members of A and estimate the average distortion for 
this ensemble. From the bounds on the average it will follow that at least one code ex- 
ists in the ensemble with the desired properties. 

The ensemble of codes is defined as follows. For the given D* there will be a set of 
transition probabilities q^j) that result in the minimum R, that is, R(D*). The set of 
letter probabilities, together with these transition probabilities, induce a measure 
Q(Z) in the space of reproduced words. The Q measure for a single z letter, say letter 
ji is Z Pj^j). The Q measure for a Z word consisting of letters J r J 2 , . . . , J t is 



In the ensemble of codes of length t, the integers from 1 to N are mapped into Z 
words of length t in all possible ways. An integer is mapped into a particular word Zj, 
say, with probability Q(Z^), and the probabilities for different integers are statistically 
independent. This is exactly the same process as that of constructing a random code 
ensemble for a memory less channel, except that here the integers are mapped into the 
Z space by using the Q(Z) measure. Thus we arrive at a set of codes (if there are f 
letters in the Z alphabet there will be f tN different codes in the ensemble) and each 

code will have an associated probability. The code in which integer i is mapped into 
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Z. has probability "j^ Q(Z t ). 

We now use Lemma 1 to bound the average distortion for this ensemble of codes (us- 
ing the probabilities associated with the codes in calculating the average). Note, first, 
that in the ensemble of codes if Q( ft) is the Q measure of a set ft of Z words, then the 
probability that this set contains no code words is [ 1 - Q(ft )f* t that is, the product of 
the probability that code word 1 is not in ft , that for code word 2, etc. Hence the pro- 
bability that ft contains at least one code word is 1 - ( 1 - Q( ) ] N . Now, referring to 
Lemma 1, the average distortion may be bounded by 



Here D is the largest possible distortion between an m letter and a z letter. The 
max 

first term, c D , arises from message words M which are not in the set a . These 
have total probability less than or equal to c and, when they occur, average distortion 
less than or equal to B max . The second term overbounds the contribution that is due 

to cases in which the set ft .. for the message M does not contain at least one code 
sA 

word. The probability of this in the ensemble is certainly bounded by El - Q(^ M )1 N 
and the distortion is necessarily bounded by D max - Finally, if the message is ina and 
there is at least one code word in 0^, the distortion Is bounded by D*+ c, according 
to Lemma 1. Now, Q( ft^) e" fc R ( D *) + 1. Also, for < x <1, 



n 

(1 - x) e ^e ' -e 

(using the alternating and monotonically decreasing nature of the terms of the logarith- 
mic expansion). Hence 



+ c] _ 1 -t [R(D*) + t] 



r i t[R(D*)+] 

_ \ -t[R(D*)+] e e 



If we choose for N, the number of points, the value e 11 *^ ' * * J (or, if this is not an 
integer, the smallest integer exceeding this quantity), then the expression given above 

is bounded by e . Thus the average distortion is bounded with this choice of N by 



provided that < in liemma 1 is chosen small enough to make c (D +1) < 6/2 and 

"** 



- \ 6 

e 2 



then t is chosen large enough to make e 2 D ^ . We also require that c be 
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small enough and t large enough to make N, the integer just greater than or equal to 
e t[R<D*) + 2] ( Uas ^ OT equal to ,tlR(D*) +fl . ance Lemma j holdg for ^ 

sufficiently large t and any positive c , these can all be simultaneously satisfied. 

We have shown, then, that the conditions of the theorem are satisfied by the average 
distortion of the ensemble of codes. It follows that there exists at least one specific 
code in the ensemble whose average distortion is bounded by D*+ 6 . This concludes 
the proof. 

Corollary. Theorem 2 remains true if 6 is replaced by in property (1). It also 
remains true if the 6 in property (1) is retained and the 5 in property (2) is replaced 
by 0, provided in this case that D*> D . , the smallest D for which R(D) is defined. 

This corollary asserts that we can attain (or do better than) one coordinate of the 
R(D) curve and approximate, as closely as desired, the other, except possibly for the 
D . point. To prove the first statement of the corollary, note first that it is true for 
D* > ^aax' *k* value for which E.(E ma3C ) * & Indeed, we may achieve the point 
D = D with N = 1 and a code of length 1, using only the Z word consisting of the 
single Z letter which gives this point of the curve. For I> min ^D* < D ma3C *PPty 
Theorem 2 to approximate D** = D* + - . Since the curve is strictly decreasing, this 
approximation will lead to codes with IS <D*+ 6 and log N <R(D*), if the a in 
Theorem 2 is made sufficiently small. 

The second simplification in the corollary is carried out in a similar fashion, by 
choosing a D** slightly smaller than the desired D* that is, a D** such that 
R(D**)= R(D*) + |, and by using Theorem 2 to approximate this point of the curve. 

Now suppose we have a memoryless channel of capacity C. By the coding theorem 
for such channels it is possible to construct codes and decoding systems with rate 
approximating C (per use of the channel) and error probability < tj for any c^ > 0. 
We may combine such a code for a channel with a code of the type mentioned above for 
a source at a given distortion level D* and obtain the following result. 

Theorem 3. Given a source characterized by R(D) and a memoryless channel with 

capacity C> 0, given c > and D*> D . , there exists, for sufficiently large t and 

mm 

n, a block code that maps source words of length t into channel input words of length 
n and a decoding system that maps channel output words of length n into reproduced 

words of length t and satisfying 
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2) 

Thus we may find codes with average distortion level as good or better than any de- 
sired D* (greater than D . ) and at the same time approximate using the channel at a 
rate corresponding to R(D*). This is done, as in the corollary stated above, by app- 
roximating the R(D) curve slightly to the left of D*, say, at R(D*) - 6 . Such a code 
will have N e ^ R ^ D * " 6 )+ 6 1 1 WO rds, where 6 1 can be made small by taking t 
large. Next, a code for the channel is constructed with N words and of length n, the 
largest integer satisfying _ < R(D* - 6 ) + 6j. By choosing t sufficiently large, 
this will approach zero error probability, since it corresponds to a rate less than 
channel capacity. If these two codes are combined, it produces an over-all code with 
average distortion < D*. 
Numerical Results for Some Simple Channels 

In this section some numerical results will be given for certain simple channels 
and sources. Consider, first, a binary independent letter source with equiprobable 
letters and suppose that the distortion measure is the error probability (per letter). 
This falls into the class for which a simple explicit solution was given. The R(D) 
curve, in fact, is 

R(D) 1 +D log, D + (1 - D) log, (1 - D) (bits per source letter) 

f> , 

This, of course, is the same formula as that for the capacity of a symmetric binary 
channel with probabilities D and (1 - D), the reason being that these constitute the pro- 
bability assignment q^(j) which solves the minimizing problem. 

This R(D) curve is shown in Fig. 5. Our coding results interpreted here say, for 
example, that to transmit this source over a channel with error probability . 1 requires 
a channel capacity of at least . 56 bits per digit transmitted and if the channel capacity 
available is this large, codes can be constructed which will approach this error rate. 
If an error probability of . 3 can be tolerated, a capacity of only about . 1 bit is neces- 
sary and sufficient. If . 5 error probability can be tolerated, of course no channel 
capacity is required. Indeed, one might write down at the receiving point a series of 
zeroes. 

Also plotted in Fig. 5 are a number of points corresponding to specific simple codes 
where we have assumed a noiseless binary channel is available. A particular code is 
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represented by a point whose abscissa is the error probability for the code and whose 
ordinate is the channel capacity used per message letter. All points must therefore 
lie on or above the R(D) curve and their distance above is a measure of how closely 
they approximate ideal encoding* 

One point, D - 0, is obtained with capacity 1 bit per message letter simply by send- 
ing the binary digits through the channel. Other simple codes which encode 3, 5, 7, and 
9 message letters into one channel letter are the following. For the ratio 5, for ex- 
ample, encode message sequences of five digits into or 1 accordingly as the se- 
quence contains more than half zeros or more than half ones. 

At the receiving point, a is decoded into a sequence of zeros of the appropriate 
length and a 1 into a sequence of ones. These rather degenerate codes are plotted in 
Fig* 5 with crosses. Simple though they are, with block length of the channel se- 
quences only one, they still approximate to some extent the lower bound. 

Plotted on the same curve are solid points corresponding to the well-known single- 
error correcting codes [2] with block lengths 3, 5, 7, 15, and 31. These codes are used 
backwards here - any message in the 15- dimensional cube, for example, is transmit- 
ted over the channel as what would ordinarily be the eleven message digits of its near- 
est code point. At the receiving point, the corresponding fifteen-digit message is re- 
constructed. This can differ at most in one place from the original message. Thus 
for this case the ratio of channel to message letters is - , and the error probability 

is easily found to be . This series of points gives an approximation to the lower 
16 

bound for lower values of D. 

It is possible to fill in densely between points of these discrete series by a technique 
of mixing codes. For example, one may alternate in using two codes. More generally, 
one may mix them in proportions \ and 1 - X , where X is any rational fraction. 
Such a mixture gives a code with a new ratio R of channel to message letters, given 

by R X R, + (1 - X )R , where R. and R are the ratios for the given codes, and with 
2 * 2 

new error probability P ft \ P e * (1 - A. )P e2 - This gives a linear interpolation be- 
tween any two code points. For example, when applied to two of the simple codes in 
Fig. 5, it produced the series of points indicated by open circles. This mixing tech- 
nique may be applied in any case of a single-letter distortion measure to give a linear 
interpolation between two known codes. It may also be used to give an alternative 
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proof of the convexity of the R(D) curve. 

Another channel was also studied with regard to actual codes, namely, the binary 
symmetric channel of capacity C - . This has probabilities 0. 89 that a digit is re- 
ceived correctly and 0. 11 incorrectly. Here the series of points (Fig. 6) for simple 
codes actually touches the lower bound at the point R i . This is because the chan- 
nel itself, without coding, produces just this error probability. Any symmetric binary 
channel will have one point that can be attained exactly by means of straight transmis- 
sion, when error probability is the distortion measure. 

Figure 7 shows the R(D) curve for another simple situation, a binary independent 
letter source but with the reproduced Z alphabet consisting of three letters, 0, 1, and 
?. The distortion measure is zero for a correct digit, one for an incorrect digit, and 
0.25 for ?. In the same figure is shown, for comparison, the R(D) curve without the 
? option. 

Figure 8 shows the R(D) curves for independent letter sources with various numbers 
of equiprobable letters in the alphabet (2, 3, 4, 5, 10, 100). Here again the distortion 
measure is taken to be error probability (for a reproduced digit). With b letters in 
the alphabet the R(D, b) curve is given by 

n b - 1 

R(D.b)\ Iog 2 b + D log -= + (1 - D) log (1 - D) D < o 

" h 1 ~ < 



D.b)(log 2 b + ] 

lo 



D > b 

Generalization to Continuous Cases 

We will now sketch briefly a generalization of the single-letter distortion measure 
to cases where the input and output alphabets are not restricted to finite sets but vary 
over arbitrary spaces. 

Assume a message alphabet A - {m} and a reproduced letter alphabet B {z} . 
For each pair (m, z) in these alphabets let d(m, z) be a non-negative number, the dis- 
tortion if m is reproduced as z. Further, we assume a probability measure P defined 
over a Borel field of subsets of the A space. Finally, we require that, for each z be- 
longing to B, d(m,z) is a measurable function with finite expectation. 

Consider a finite selection of points z^i 1, 2, ...,</) from the B space, and a 
measurable assignment of transition probabilities qfz^m). (That is, for each 
i, q(x i |m) is a measurable function in the A space. ) For such a choice of z i and assign- 
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mcnt q(z.|m), a mutual information and an average distortion are determined. 

q(z.| m) 

R. S / q(z|m)log 
i A l 

D - S / d(m, z ) q(z. | m) dP{m) 
i A l l 

We define the rate distortion function R(D*) for such a case as the greatest lower 
bound of R when the set of points z. is varied (both in choice and number) and the 
q(z.|m) is varied over measurable transition probabilities, subject to keeping the dis- 
tortion at the level D* or less. 

Most of the results we have found for the finite alphabet case carry through easily 
under this generalization. In particular, the convexity property of the R(D) curve still 
holds. In fact, if R(D) can be approximated to within by a choice z { and q^lm) and 
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R(D') by a choice z) and q'(z!|m), then one considers the choice z'. 1 consisting of the 
union of the points z. and zj, together with q"(z^|m) - - [q(s l> |m) + q l (>"|zn)] , (using 
zero if q(z"|m) or q'(z"|m) is undefined). This leads, by the convexity of R and by 
the linearity of D, to an assignment for D"=s - D+ - D 1 , giving an R" within c of the 
midpoint of the line joining (D, R(D)) and (D 1 , R(D'}). It follows, since c can be made 
arbitrarily small, that the greatest lower bound of R(D") is on or below this midpoint. 

In the general case it is, however, not necessarily true that the R(D) curve ap- 
proaches a finite end-point when D decreases toward its minimum possible value. The 
behavior may be as indicated in Fig. 9 with R(D) going to infinity as D goes to D min . 
On the other hand, under the conditions we have stated, there is a finite I> max for 
which R( D max ) " 0. This value of D is given by 
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z 

The negative part of the coding theorem goes through in a manner essentially the 
same as the finite alphabet case, it being assumed that the only allowed coding func- 
tions from the source sequences to channel inputs correspond to measurable subsets 
of the source space. (If this assumption were not made, the average distortion would 
not, in general, even be defined. ) The various inequalities may be followed through, 
changing the appropriate sums in the A space to integrals and resulting in the corres- 
ponding negative theorem. 

For the positive coding theorem, also, substantially the same argument may be 
used with an additional c involved to account for the approximation to the greatest 
lower bound of R(D) with a finite selection of z^ points. Thus, one chooses a set of z^ 
to approximate, within c , the R(D) curve and then proceeds with the random coding 
method. The only point to be noted is that the D max term must now be handled in a 
slightly different fashion. To each code in the ensemble one may add a particular 
point, say Z Q , and replace I> max by E(d(m, Z Q )), a finite quantity. The results of the 
theorem then follow. 
Difference Distortion Measure 

A special class of distortion measures for certain continuous cases of some impor- 
tance and for which more explicit results can be obtained will now be considered. For 
these the m and z spaces are both the sets of all real numbers. The distortion meas- 
ure d(m, z) will be called a difference distortion measure if it is a function only of the 
difference m - z, thus d(m,z) e(m - z). A common example is the squared error 
measure, d(m,z) (m - z) or, again, the absolute error criterion d(m, z) }m - zj . 

We will develop a lower bound on R(D) for a difference distortion measure. First 
we define a function d(D) for a given difference measure e(u) as follows. Consider an 
arbitrary distribution function G(u) and let H be its entropy and D the average distor- 
tion between a random variable with the given distribution and zero. Thus 



du 
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We wish to vary the distribution G(u), keeping D < D* and seek the maximum H. The 
least upper bound, if finite, is clearly actually attained as a maximum for some dis- 
tribution. This maximum H for a given D* we call rf(D*), and a corresponding distri- 
bution function is called a maximizing distribution for this D*. 

Now suppose we have a distribution function for the m space (generalized letter pro- 
babilities) P(m), with entropy H(m). We wish to show that 

R(D) >H(m) - d(D) 

Let z i be a set of z points and q(zj m) an assignment of transition probabilities. Then 
the mutual information between m and z may be written 

RH(m) - S QiHdnjz.) 

where Q^ is the resulting probability of z^. If we let D^ be the average distortion be- 
tween m and z., then 



This is because 4(D) was the maximum H for a given average distortion and also be- 
cause the distortion is a function only of the difference between m and z, so that this 
maximizing value applies for any z.. Thus 
RH(m)- S Q.dfD^ 

Now cf(D) is a concave function. This is a consequence of the concavity of entropy 
considered as a function of a distribution function and the linearity of D in the same 
space of distribution functions, by an argument identical with that used previously. 

Hence, Z Q. 4(D.) *d( X Q. D ) * d(D), where D is the average distortion with the 
i i l l 

choice z. and the assigned transition probabilities. It follows that 

R^H(m) - d(D) 
Since this is true for any assignment z. and q(z i |m) the desired result is proved. 

If, for a particular P(m) and e(u), assignments can be made which approach arbi- 
trarily close to this lower bound, then, of course, this is the R(D) function. Such is 
the case, for example, if P(m) is gaussian and e(u) u (mean square error measure 
of distortion). Suppose that the message has variance cr , and consider a gaussian 
distribution of mean zero and variance <r - D in the z space. (If this is zero or nega- 
tive, clearly R(D) by using only the z point zero. ) Let the conditional probabilities 
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q(m|z) be gaussian with variance D. This is consistent with the gaussian character 
of P(m), since normal distributions convolve to give normal distributions with the sum 
of the individual variances. These assignments determine the conditional probability 
measure q(z| m), also then normal. 

A simple calculation shows that this assignment attains the lower bound given above. 
The resulting R(D) curve is 
log 



D ><r 2 

This is shown for <r 2 * 1 in Fig. 9. 
Definition of a Local Distortion Measure 

Thus far we have considered only a distortion measure d (or d(m,z)) which depends 
upon comparison of a message letter with the corresponding reproduced letter, this 
letter-to-letter distortion to be averaged over the length of message and over the set 
of possible messages and possible reproduced messages. In many practical cases, 
however, this type of measure is not sufficiently general. The seriousness of a parti- 
cular type of error often depends on the context. 

Thus in transmitting a stock market quotation, say: "A. T.& T. 5900 shares, closing 
194, " an error in the 9 of 5900 shares would normally be much less serious than an 
error in the 9 of the closing price. 

We shall now consider a distortion measure that depends upon local context and, in 
fact, compares blocks of g message letters with the corresponding blocks of g letters 
of the reproduced message. 

A local distortion measure of span g is a function d(m , m 2 , ...,m ; z,,z ,...,z ) 
of message sequences of length g and reproduced message sequences of length g (from 
a possibly different or larger alphabet) with the property that d > 0. The distortion be- 
tween M m 1 ,m 2 , ...,m t and Z- z^z^...,^ (t >g) is defined by 



The distortion of a block code in which message M and reproduced version Z occur 
with probability P(M, Z) is defined by 
d Z P(M, Z)D(M, Z) . 



1Z2 



In other words, we assume, with a local distortion measure, that the evaluation of an 
entire system is obtained by averaging the distortions for all block comparisons of 
length g each with its probability of occurrence a weighting factor. 
The Functions R^P) and R(D) for a Local Distortion Measure and Ergodic Source 

Assume that we have given an ergodic message source and a local distortion mea- 
sure. Consider blocks of n message letters with their associated probabilities (as de- 
termined by the source) together with possible blocks Z of reproduced message of 
length n. Let an arbitrary assignment of transition probabilities from the M blocks 
to the Z blocks, q(ZlM), be made. For this assignment we can calculate two quanti- 
ties: 1) the average mutual information per letter R= - E/log land 

n \ Q(Z) / 

2) the average distortion if the M's were reproduced as Z's with the probabilities 

q(Z|M). This is D- P(M, Z)D(M, Z). By variation of q(Z]M), while holding 

M, Z 
D <D*, we can, in principle, find the minimum R for each D*. This we call R n (D*). 

The minimizing problem here is identical with that discussed previously if we think 
of M and Z as individual letters in a (large) alphabet, and various results relating to 
this minimum can be applied. In particular, R n (D) is a convex downward function. 

We now define the rate distortion function for the given source relative to the dis- 
tortion measure as 

R(D) lim inf R n (D) . 

n 

It can be shown, by a direct but tedious argument which we shall omit, that the "inf 11 
may be deleted from this definition. In other words, RjJD) approaches a limit as 
n * co . 

We are now in a position to prove coding theorems for a general ergodic source with 
a local distortion measure. 
The Positive Coding Theorem for a Local Distortion Measure 

Theorem 4. Suppose that we are given an ergodic source and a local distortion mea- 
sure of span g with rate distortion function R(D). Let K be a memoryless discrete 
channel with capacity C, let D* be a value of distortion, and let c be a positive num- 
ber. Then there exists a block code with distortion less than or equal to D* + c , and 

with a signaling rate at least ( ^ - c ) message letters per channel letter. 

R(I>*J 

Proof. Choose an nj so that R n (D*) - R(D*)< ^ and, also, so large that 
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5- D max < - . Now consider blocks of length n as "letters" of an enlarged alphabet. 
Using Theorem 3 we can construct a block code using sufficiently long sequences of 
these "letters" signaling at a rate close to (say within /3 of) R Q (D*)/C (in terms of 
original message letters) and with distortion less than D* 4- - . It must be remem- 
bered that this distortion is based on a single "letter" comparison. However, the dis- 
tortion by the given local distortion measure will differ from this only because of over- 
lap comparisons (g for each n letters of message) and hence the discrepancy is, at 
most, - ^ max < | . It follows that this code signals at a rate within c of R(D*) 
and at a distortion within c of D*. 
The Converse Coding Theorem 

Theorem 5. Suppose that we are given an ergodic source and a local distortion 
measure with rate distortion function R(D). Let K be a memoryless discrete channel 
with capacity C, let D* be a value of distortion, and let c be a positive number. Then 
there exists t which is such that any code transmitting t> t message letters with n 
uses of the channel at distortion D*, or less, satisfies 

j? C R(D*) - . 

That is, the channel capacity bits used per message letter must be nearly R(D*) for 
long transmissions. 

Proof. Choose t Q so that for t > t Q we have R t (D) R(D) - c . Since R(D) was de- 
fined as lizn inf R (D), this is possible. Suppose that we have a code for such a t > t _ 
t - o 

which maps sequences M consisting of t message letters into sequences X of n channel 
letters and decodes sequences Y of n channel output letters into sequences Z of repro- 
duced messages. The channel will have, from its transition probabilities, some 
P(Y|X). Furthermore, from the encoding and decoding functions, we shall have 
X f<M) and Z = g(Y). Finally, there will be, from the source, probabilities for the 
message sequences P(M). I/ue to the encoding function f(M) this will induce a set of 
probabilities P(X) for input sequences. If the channel capacity is C, the average mu- 
tual information R(X, Y) between input and output sequences must satisfy 

R(X, Y) . E log P ^P < nC 

since nC is the maximum possible value of this quantity when P(X) is varied. Also, 
since X is a function of M and Z is a function of Y, we have 
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R(M, Z) - E log f < R(X, Y) <SnC 

s*(M.) 

The coding system in question amounts, over-all, to a set of conditional probabilities 
from M sequences to Z sequences as determined by the two coding functions and the 
transition probabilities. If the distortion of the over-all system is less than or equal 

to D*, then tR (D*) min R(M, Z) is certainly less than or equal to the particular 

P(Z|M) 
R(M, Z) obtained with the probabilities given by the channel and coding system. 

(R (D*) is multiplied by t because R.{D) is measured on a per message letter basis, 
while the R(M, Z) quantities are for sequences of length t. ) Thus 
tR (D*) <R(M, Z) <nC 
t(R(D*) - t ) nC 
n C > R(D*) - 
This is the conclusion of the theorem. 

Notice from the method of proof that the code used again need not be a block code, 
provided only that after n uses of the channel t recovered letters are written down. If 
one has some kind of variable- length code and, starting at time zero, uses this code 
continually, the inequality of the theorem will hold for any finite time after t Q mes- 
sage letters have been recovered; and of course as longer and longer blocks are com- 
pared, 0. It is even possible to generalize this to variable-length codes in which, 
after n uses of the channel, the number of recovered message letters is a random va- 
riable depending, -perhaps, on the particular message and the particular chance oper- 
ation of the channel. If, as is usually the case in such codes, there exists an average 
signaling rate with the properties that after n uses of the channel then, with probability 
nearly one, t letter* will be written down, with t lying between tj(l- 6 ) and t^l + 6 ) 
(the 6 - as n o), then essentially the same theorem applies, using the mean t^ 
for t. 
Channels with Memory 

Finally, we mention that while we have, in the above discussion, assumed the chan- 
nel to be mexnoryless, very similar results, both of positive and negative type, can 
be obtained for channels with memory. 

For a channel with memory one may define a capacity C n for the first n uses of the 
channel starting at state s . This C is - times the maximum average mutual informa- 

9 O 11 
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tion between input sequences of length n and resulting output sequences when the pro- 
babilities assigned the input sequences of length n are varied. The lower bound on dis- 
tortion after n uses of the channel is that given by Theorem 1 using C for C. 

We can also define the capacity C for such a channel as C = lim sup C n . The posi- 
tive parts of the theorem then state that one can find arbitrarily long block codes sa- 
tisfying Theorem 3. In most channels of interest, of course, historical influences die 
out in such a way as to make C - C as n . For memoryless channels, C * C 
for all n. 
Duality of a Source and a Channel 

There is a curious and provocative duality between the properties of a source with a 
distortion measure and those of a channel. This duality is enhanced if we consider 
channels in which there is a "cost" associated with the different input letters, and it 
is desired to find the capacity subject to the constraint that the expected cost not ex- 
ceed a certain quantity. Thus input letter i might have cost Vj and we wish to find the 
capacity with the side condition S Pj v t v. say, where P.. is the probability of us- 
ing input letter i. This problem amounts, mathematically, to maximizing a mutual in- 
formation under variation of the P i with a linear inequality as constraint. The solution 
of this problem, leads to a capacity cost function C(v) for the channel. It can be shown 
readily that this function is concave downward. Solving this problem corresponds, in 
a sense, to finding a source that is just right for the channel and the desired cost. 

In a somewhat dual way, evaluating the rate distortion function R(D) for a source 
amounts , mathematically, to minimizing a mutual information under variation of the 
qjO). again with a linear inequality as constraint. The solution leads to a function 
R(D) which is convex downward. Solving this problem corresponds to finding a channel 
that is just right for the source and allowed distortion level. This duality can be pur- 
sued further and is related to a duality between past and future and the notions of con- 
trol and knowledge. Thus, we may have knowledge of the past but cannot control it; 
we may control the future but have no knowledge of it. 
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GROUP TESTING TO CLASSIFY EFFICIENTLY ALL UNITS IN A BINOMIAL SAMPLE 

Milton Sobel 

1. Summary. 

A number N of units are to be classified as good or defective by means of "group- 
tests. " A "group- test" is a simultaneous test on x units with only two possible out- 
comes: "success" indicating that all x units are good and "failure" indicating that at 
least one of the x units is defective (we don't know how many or which ones). The prob- 
lem is to find a simple and efficient procedure or to find the most efficient procedure 
for classifying each of the N units as good or defective. For finite N, efficiency is de- 
fined in the sense of minimizing the expected number of group-tests required; for in- 
finite N, efficiency is defined in the sense of maximizing the expected number of units 
classified per test in the long run as the number of tests increases. 

At the outset any set of units is assumed to be a random sample of independent ob- 
servations from a binomial distribution with a common known a priori probability q of 
a unit being good and p 1-q of being defective. 

A simple procedure (or decision rule) R^ which describes a mode of action for any 
given value of q, is proposed and compared with other procedures for the same prob- 
lem. Explicit instructions for carrying out R^ are given in [4] for values of N from 
1 to 16 for all q and for values of N from 17 to 100 for q = . 90, . 95 and . 99. Section 
14 gives for any N and any q an alternative way of carrying out R I which does not re- 
quire the computation of special tables. Exact formulas and numerical results for the 
expected number of group- tests required are given in [4] and some of the latter are 
repeated in Table n below; two lower bounds are described in Section 12 with numeric- 
al values given in Table II. 

Technical applications, other than the known application to blood testing, and some 
conjectures on optimality are given, in Section 2. 

Several different generalizations of the problem are mentioned in Section 11; detailed 
formulas are given in [4] . 
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To show that Rj is not generally optimal in the finite case, a modification K. of R, 
is defined in Section 13 and its improvement over Rj is shown in Table II. 

Another procedure R-, which is simpler to compute and is related to R, in several 
ways, is defined in Section 15 in terms of information theory concepts. A "halving" 
procedure R^ is also defined for purposes of comparison; numerical comparisons are 
given in [4 ] for the finite case and in Table I below for the infinite case. 

A good deal of this paper is a restatement without proofs of results proved in [4] 
and of conjectures first stated in [4] . Much of Sections 6, 12, 13, and 14 is new ma- 
terial. 
2. Introduction. 

A problem which has hitherto been considered only in connection with blood-testing 
applications [1] , [2 ], [5] is shown to have industrial applications, and these have 
focused interest on a more general treatment of the problem. During World War n a 
great saving was accomplished in the field of blood testing by pooling a fixed number 
of blood samples and testing the pooled sample for some particular disease. If the dis- 
ease was not present, then several people were passed by a single test; if the disease 
was present, then there was enough blood remaining in each blood sample to test each 
one separately. The amount of time, money, and effort saved by such a procedure de- 
pends on how rare this disease is in the population of people being tested. In this ap- 
plication the total number of people to be tested was regarded as unknown and very 
large. 

The goal of the problem treated here is the same, namely to separate the defective 
units from the good units with a minimal (or approximately minimal) number of group- 
tests. The main problem treated here is a generalization of that above in the following 
respects: 

i) The population size N (number of people to be tested above) is finite and known at 
the outset. The case N is briefly discussed in Section 6. 

ii) The number of units in each group-test (pooled blood samples above) is not neces- 
sarily constant. Actually this is a consequence of i) since the size of the next 
group- test cannot be greater than the number of units not yet classified. 

iii) If a group-test fails (at least one defective - or in the context above, at least one 
diseased sample - is present) then we do not necessarily test each item separately. 
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In practice the simplicity of the procedure deserves some consideration. The 
proposed procedure Rj defined in Section 3, after having been computed and described 
explicitly in advance of any experimentation, is in some sense no more complicated 
than the blood-testing procedure described above; this is explained in Section 5. 

Some typical industrial applications are the following; in each case the a priori 
probability q of a unit being good is assumed to be given. 

1. It is desired to remove all "leakers" from a set of N devices. One chemical 
apparatus is available and the devices are tested by putting x of them (where 1 x < N) 
in a bell jar and testing whether any of the gas used in constructing the devices has 
leaked out into the bell jar. It is assumed that the presence of gas in the bell jar indi- 
cates only that there is at least one leaker and that the amount of gas gives no indica- 
tion of the number of leakers. 

2. Paper capacitors are tested at most N at a time and each test indicates by the 
presence (or absence) of a current that there is at least one defective (or no defectives) 
present. For given N and given cost of unit manufacture, should the operator throw 
away a whole set of N units if it contains at least one defective? If not, how should he 
proceed to sort out the defective units to minimize the expected number of tests re- 
quired? If the cost of a group-test and the cost of producing a unit are known then a 
related problem is to find a procedure which minimizes the total cost (including test- 
ing costs) of producing a good unit. 

3. Christmas Tree Lighting Problem; A batch of N light bulbs are electrically ar- 
ranged in series and tested by applying a voltage across the whole batch or any subset 
thereof. If this is to be done on a routine basis, what procedure should be used to mi- 
nimize the expected number of tests required to remove all the defective light bulbs, 
assuming the value of q is given? 

4. A test indicates whether or not there is at least one good unit present in a batch 
of N, without indicating which ones or how many are good. What procedure should be 
used to find all the good units? This problem is dual to those above. For example, if 
the probability that a fuse is faulty is very low, one could test many fuses in series for 
continuity, as in the Christmas tree problem; but if the same probability is very high, 
one could test many fuses in parallel for continuity. The identical analysis applies, 
mutatis mutandis. 
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A procedure, R, , which describes for each value of q a sequence of tests leading to 
the classification of each of the N units as good or defective, is proposed and compared 
with several other procedures applicable to the same problem. The procedure R. is 
simple in the sense that between any two successive tests, i) future tests are concerned 
only with units not yet classified as good or defective, ii) units not yet classified have 
to be separated into only (at most) two sets, and iii) the units within each set need not 
be distinguishable. 

1. Based on the assumption that at any stage the next group for testing is taken from 
only one of these two sets (and is not formed by mixing together units from both sets), 
the procedure R. is shown below to be optimal for all values of q. 

2. If it is given that the units are never distinguishable from one another (except as 
they are indicated to be good or defective) - i.e. , that it is impossible or economically 
impractical to identify individual units - then the procedure Rj is conjectured to be op- 
timal for all q. 

3. In some applications the units are linearly arranged in fixed positions and tests 
can only be carried out on an "interval" of successive units. The procedure R I can be 
applied to such problems and it is conjectured to be optimal for all q. 

4. Although the procedure R depends on the given finite value of N, a natural modi- 
fication R 21 of Rj can be applied when units continue indefinitely to arrive on an as- 
sembly line basis, i. e. , when N = . In this case every unit must be classified in a 
finite number of steps. If we consider the limit as T o of the expectation of the 
ratio of the number of units classified in T tests to T as a criterion for efficiency, then 
R 21 lfl c 01 ^ ** 1 * 6 * 1 to be optimal for all q. This discussion is amplified in Section 6 
below. 

5. Finally, it is conjectured that, for q < (1 + /5T)/8 .843, the procedure Rj is 
optimal for any finite N. 

For finite N, another procedure R is described which is identical with R for 
q < . 843 and is an improvement over R Z for q > . 843; however the procedure R is also 
more complicated than R^ The procedure R Q is optimal for all q if N is very small; 
it is not known whether or to what extent this property remains true for intermediate 
and large values of N. 

Among several other procedures defined for purposes of comparison with Rj, one of 
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these, called R^, is simpler to compute explicitly. It is defined in Section 15 in terms 
of information theory concepts. Another procedure, R , is a "halving procedure" 
which starts by testing all the N units and, if a defective is present, the next test is 
carried out on N/2 or (N- l)/2 items which are chosen at random. This procedure has 
the advantage that it can be carried out without knowing the true value of q. 

Several different directions for generalization of the basic problem and correspond- 
ing generalizations of the procedure Rj are considered in Section 11. 
3. The Procedure Rj 

The procedure R^ will be defined implicitly by a pair of recursion formulas and 
simple boundary conditions after some definitions and preliminary results. In the 
course of experimentation under Rp the units proven good and the units proven defec- 
tive are never 1 used in subsequent tests. Aside from such units, the procedure R. re- 
quires that between any two successive tests the remaining (or unclassified) units of 
number n < N be separated into at most two sets. One set of size m > , called the 
"defective set, " is known to contain at least one defective unit if m > 1; (it is not known 
which ones are defective nor exactly how many there are. ) The other set of size 
n-m > is called the "binomial set" because we have no knowledge about it other than 
the original binomial assumptions, i. e. , given the past history of testing, the a poste- 
riori distribution associated with these units is that of independent binomial chance 
variables with a common probability q of being good. Either of these two sets can be 
empty in the course of experimentation; both are empty at termination. 

For a defective set of size m > 1 the (conditional) probability that the number of de- 
fective units Y present equals y is 

/m\ 

, ) 

(1) 



(mjpyqm-y 
FJ Yy|Y>l} - 1 (y-1.2,... 



If 2 denotes the number of defectives present in a subset of size x (with 1 < x <m) 
randomly chosen from the defective set, then 



(2) FU< 



The following simple result, which is a special case of a lemma proved in [4] , 



1 In certain classical weighing problems with a known number of defectives, involving 
a pan balance, such units are used in subsequent tests. 
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plays a fundamental role in the derivation of R. below. 

Lemma 1; Given a defective set of size xn > 1, and given that a proper subset of size 
x with 1 <_x < m also proves to contain at least one defective, then the a posteriori 
distribution associated with the xn-x remaining units is precisely that of m-x inde- 
pendent binomial chance variables with the common probability q of being good. 

As a result of the above lemma we can test proper subsets of a defective set and be 
assured that, regardless of the outcome, we will always have at most one defective set 
to work with. This procedure can result in two binomial sets but these can be com- 
bined without any loss of "information"; we shall refer to this process of combining 
binomial sets as "recombination. " 

Let Ej | T;m, n, qj = Gj(m, n) denote the expected number of group-tests remain- 
ing to be performed if the defective set is presently of size m, the binomial set is 
presently of size n-m, the a priori probability of a good unit is the known constant q, 
and the procedure Rj is used; for the special case m we use the symbol 
Ej { T;n, q; Hjfn). The values of m and n vary as the procedure is carried out; at 
the outset m and n = N. It will also be convenient to refer to the "G- situation" or 
"G(m,n)- situation" if m> 2 and to the "H- situation" or "H(n)- situation" if m * 0; the 
case m 1 < n is immediately reducible to an H- situation without any testing. 
Recursion Formulas Defining Procedure R^. 

If x denotes the size of the very next group- test then we write for any H- situation 

(3) HjCn)-!* min (o^H^n-x) + (1 - c^JG^x.n)} 

1 < x < n 

= s 

and, for any G- situation, from equation (2) and Lemma 1 



(4) Gdn.n) - 1 * min 

-q m 



} G^m-x, n-x) + ( i^L\ O.(x,a)? 
' M-q m ' l 

and the boundary conditions state that for all q 

(5) H 1 (0)OandG 1 (l,a)H 1 (n-l)for al t 2 ..... 

The subscripts in H 1 (n), etc. refer to the procedure Rj. In writing (3) and (4) the con- 

stant 1 represents the very next group-test of size x and the expression in braces is 

the conditional expected number of additional group- tests given x. It follows from (3) 

and (5) that H^l) =1 for all q. 

Remark 1; To justify writing G^x, n) in (4) we make use of Lemma 1 with a defective 
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set of size m > 2 and a subset of size x < m. Then, by Lemma 1, if the subset of size 
x is shown to contain at least one defective, the a posteriori distribution associated 
with the remaining m-x units is exactly binomial. These are then mixed or re com- 
bined with the n-m "binomial units" giving a total of n-x binomial units. 
Remark 2; These two recursion formulas together with the boundary conditions allow 
one to compute successively for any q the functions HjU), G^Z, 2), Hj(2), G 1 (2,3), 
0^3,3), H^S), G 1 (2,4), 0^3,4), 0^4,4), H 1 (4),... to any desired value of m and n. 
Remark 3; The integer x which accomplishes the minimization in (3) and (4) for each 
situation characterized by the integers m and n is particularly important since this is 
the size of the next test to be run according to the procedure Rj. These integers 
x - x^njq) and x = x^xn, n;q) define the procedure R^. An illustration of how the pro- 
cedure R, is to be carried out is given in Section 4. It is pointed out in Section 5 that 
x. (m, n;q) depends only on m and q and a procedure for computing it is given there. 
Remark 4; If m> 2 then it is assumed in (4) that a subset of size x with 1 < x < m will 
be taken from the defective set without mixing them with units from the binomial set. 
It follows from the expressions (3), (4) and (5) that any lack of optimality can only 
arise from this "no mixing" assumption. This assumption was used in the derivation 
of the algorithm (4) (See Remark 1 above). It will be noted in Section 13 that when all 
the units are Individually identified then an improvement to the procedure R, for high 
values of q can be obtained by dropping this assumption. A specific example R Q of a 
modification and improvement of the procedure R. which drops the "no mixing" as- 
sumption at the expense of more complication will be briefly discussed in Section 13. 
4. Illustration of the Procedure R. 



Suppose we start with N 12 units and it is given that q - .98. As indicated in the 
tabulations in [4] , the first test-group is of size x 12, i.e. , we start by testing all 
12 units. If a success occurs the experiment is over; if a failure occurs then, by the 
tabulations in [4 ] , the next test group is of size x 4 chosen at random from the 12. 
Similarly we continue along one of the following sample paths. 
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/r 1 <c 

.(4,4) < x=l 

A > ^ 

~ (2.4) <T 
J-, f^ 




>^ H xi 1 > f< 
, (2, 1Z) < 

Lii ^" 



H (11) 



FIG.1. PARTIAL TREE FOR PROCEDURE R (N 12, q .98) 

The complete diagram (or "tree") is not shown here but it continues in a similar man- 
ner and complete details can be obtained from Figures 3 and 4 of [4 ]. 

It is obvious that the above procedure terminates in a finite number of steps. In 
fact it can be shown for procedure ^ that the maximum number M^n) for any H-sit- 
uation (or M^m.n) for any G-situation) occurs when q is close to unity and the n un- 
classified units all happen to be defective. It follows easily that 

(6) M x (n) (n + 1) [1 + a (n) ] + 1 - 2 1 * <*> 

(7) M(m f n). a +n 1+ - . 



(m>2) 

where a (n) is defined as the positive integer for which 
(8) 2*< n > < n<2 1 + ^). 

For the above example a (12) - 3 and 1^(12) - 37. Although the maximum number is 
o large, the probability of this maximum is about I -* f and the expected number of 
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tests for q - 0. 98 is only 2. 07, with standard deviation about 2. 1. A table of probabil- 
ities for the number of tests T required with q .98 and N 12 is given below: 



T 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10-37 


Prob. 


.7847 











.0801 


.1124 


.0003 


.0016 


.0134 


.0075 



It is interesting to note that if the starting number N is exactly a power of 2 and if q 
is large then the procedure R I starts out, as in [4l exactly the same as a "halving" 
procedure. Such a halving procedure R^ is defined in [4] for any N. It has the property 
that it can be carried out without knowing the true value of q. It is shown in [ 4 ] that 
the procedure R I is better than R 4 for q . 844 and the same result also holds for 
q < . 844. For q > . 844, the maximum difference between E^(T) and Ej(T) occurs at 
q . . 844 and is equal to . 0379 for N = 6, and the procedure R appears to have a vari- 
ance smaller than that of R 4 for all q < 1. A comparison of E.(T) with E(T) for sever- 
al other procedures is given in [ 4 ] . 
5. The Simplicity of R . 

It is pointed out in this section that for any given q and any situation G(m, n) the ap- 
propriate x (i.e. , the integer which accomplishes the minimisation in (4) ) does not 
depend on n. A somewhat simpler method of computing x is given and a new function 
of m alone is introduced to replace G.(m,n) in the definition of the procedure R . For 
any m > 2 and any pair of integers (x,x +1) both possible under R , there is always a 
unique value of q, say q^x) q 1 (x,x + l;m) such that x and x + 1 yield the same mini- 
mum value in (4); this value of q separates the interval for x from the interval for 
x + 1. (This property was observed from m n < 16 and is treated as a conjecture for 
all m and n in Section 8. ) 

According to Remark 4 in Section 3 the procedure R I for m > 1 is to "break down 11 
the defective set. This "breaking down" is continued until a single unit is established 
to be defective and removed. It will be convenient to assume, without affecting the 
properties of the procedure R., that the order is randomised only once at the outset. 
Units or groups of units removed later will then be taken in that order. If the i unit 
in that order is the first defective unit, then the "breaking down" mentioned above 
leads to an H- situation with n- i binomial units and the converse also holds 



1 Even this single randomization can be disregarded in carrying out the procedure R. 
if there is no doubt about the assumption of independent chance variables or if the units 
are already well mixed in the process of delivery to the experimenter. 
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true. 1 It is convenient to introduce F 1 (m, q) * Fj(m) defined as the expected number 
of group-tests required to "break down" a defective set of size m and for the first 
time reach an H- situation when q is given. Then F^m) clearly does not depend on n 
and the above argument permits us to write 

(9) G 1 (m,n)-F 1 (inH/JLJ z q^H^n-i). 

For algebraic simplicity we let 



(10) G*(m,n) l GjCm, n) and F*(m) 
Then (4) and (9) take on the simpler forms 

(11) G*(m,n) z q U1 + min { q x G*(m - x, n - x) + G*(x, n) ] 

1 i-1 l<x<m-l C * * j 

(12) G*(m,n)-F*(m) + Z q^Hfn-i). 

1 * i-1 * 

Substituting (12) in (11). the three summations cancel and the result is 

ffi 

(13) F*(m)- S q 1 - 1 * min { q* F*(m-*) + F*(x)] 

1 il l<x<m-l l l J 

which does not depend on n. The boundary condition, F*{1) -0 for all q, also does not 
depend on n. It is clear from this derivation that (13) which does not depend on n, 
must define the same integer values x -x^q.m) as (11) or (4). This proves the fol- 
lowing theorem. 

Theorem: For any G-situatlon and any q the size of the next test group, defined im- 
plicitly by (4), does not depend on n. 

This result simplifies the explicit instructions needed to describe the procedure. 
Thus the two diagrams, Figures 3 and 4 of [4] , describe the procedure Rj for all 
values of q and for any N < 16. 

Equations (9) and (10) can also be substituted in (3) yielding 



(14) HjCBl-l+mta^J^HjCn-xJ+d-q) [F* (x) + Z q 1 ' * H^n- i)| } 

1 It follows from the above that for any procedure which "breaks down" the defective 
set in the above manner (including the method of testing units from the defective set 
one at a time until a defective unit is found) the expected number of good units elimin- 
ated between a G(m,n)- situation and the next H-situatton is q/p-mq=a/(l - q) and the 
number of defective units eliminated is always exactly one. 
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which together with (13) gives a pair of "one-dimensional" recursion formulas for de- 
fining Rj instead of the "two-dimensional" set, (3) and (4). 

Remark 5; If one were to ask for a procedure that "breaks down" the defective set in 
as small an expected number of group tests as possible then one would write (13) as 
one of the basic recursion formulas defining the procedure. This shows that Rj 
"breaks down" the defective set and returns to an H- situation in a minimal number of 
tests. 
6. The Case N . 

In tills case the model states that the binomial set remains infinite throughout ex- 
perimentation but the defective sets and sample sizes will always be finite. Instead of 
assuming that all units have to be classified as in the case of finite N, we shall now 
restrict our attention to procedures with the property that the total number of units in 
all unclassified sets, known to contain any defective units, cannot become indefinitely 
large. For example, a procedure that always disregards defective sets and continues 
to take the next test group from the infinite binomial set is eliminated from our pre- 
sent discussion. It will be assumed that the population is denumerabiy infinite and that 
the population has been arranged in an ordered sequence (u jt u^, . . . ). Some proce- 
dures, including those proposed below, have the stronger property that Uj is never 
classified later than u (j 1, 2, . . . )j this can be referred to as the "first come, first 
served" property. The criterion which seems most appealing for comparing proce- 
dures when N is to consider the limit 

(15) C'(q;R) = lim E f Number of units classified in T tests j R J 

T - < * 

and one procedure R 1 is then considered to be better than another R" over some range 
of q-values (which we take to be the open interval from to 1) if C(q;R') > C f (q;R") 
for all q, with strict inequality for at least one value of q. An optimum procedure is 
one which maximizes (15) for all q, subject to the above mentioned restriction that the 
total number of units in all unclassified sets, known to contain any defective units, can- 
not become indefinitely large. We restrict our attention to values of C'(q;R) on the open 
interval 0<q<l, and to procedures R for which the right-hand member of (15) is great- 
er than c for some positive c and all q. 

It can be shown (at least for the procedures proposed below) that C f (q;R) is the 
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reciprocal of 

(16) C(qsR) lim y. Number of tests to classify n units j p 1 

n - * n 

(in which the binomial set always remains infinite), and also that 

(17) C(q;R) lim (SlSl { R(N)) 

where the right-hand member of (17) deals with a finite population of size N, and R(N) 
is a natural modification of R for finite populations. In fact, R and R(N) can differ 
only in the H- situation; thus, if R calls for x units, then R(N) will use the smaller of 
x and the number of "binomial units 11 remaining. The proof that C'(q;R) is the recip- 
rocal of C(q;R) depends on the fact that the expressions in braces in (15) and (16) can 
be written as the ratio of means with independent, bounded, and identically distributed 
summands. Since both means converge with probability 1, we can drop the expectation 
sign in (15) and (16) without altering the value. Furthermore, the value of C'(q;R) is 
given by the ratio of expectations 

g /Number of units classified between successive) 

(18) C'(q;R) C H-situations < R J 

E {Number of group-tests between successive H- 1 
I situations ( R J 

and C(q;R) is the reciprocal of this. The proofs of these results will be published se- 
parately. 

It follows from the above that if a procedure is asymptotically optimal (i. e. , optim- 
al for very large N) in the sense that it maximizes (15) for all q then it is also asymp- 
totically optimal in the sense that it minimizes (16) for all q and vice versa. 

A natural modification of R I for the case N o will now be defined. In any G- situ- 
ation, the rule Rj can be used without change since it was shown above that the size of 
the next group test depended only on the size of the defective set. In any H- situation, 
the discussion in [4] is used as the basis of a conjecture that x (q,^) - xfq;R ) 

for any q as N . The procedure R is defined in Section 15. Since xfq;R ) does 

xi 2 

not depend on N (assuming N is large), it is natural to use this value for the case 
N - . Lt us denote this procedure by R^, since it uses R in the H-situation and 
Rj in the G-situation. Of course, R Z can also be used for N - in both G- and H- 
situations and we denote this procedure by R 22 . to make the notation consistent. Fin- 
ally, two more procedures R Q1 and R^ can be defined as follows. In the G- situation 
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we use R. and R , respectively. For the H- situation we compute the reciprocal of 

(18) for any arbitrary integer x and then define x (q, R ) (j 1, 4) to be that integer 

H Oj 

which minimizes the resulting expression, i.e., x (q, R .) is the integer for which the 
minimum 

p [l+pFf(x;q)] 
- - 



f p [l+pFf(x;q 
)- min \ - ^ 

UJ x-1,2,... t l-q x 



(19) W(q;R 

is attained. The derivation of the right-hand member of (19) is given in [4] . 

TAB LEI 

Values of C(q;R), C f (q;R) and x (q;R), Respectively, for Four Procedures at Three 
Values of q. 



q 


R 01 


R 21 


*04 


R 22 




.47251 


.47251 


.47251 


.47251 


.90 


2.116 


2.116 


2.116 


2.116 




x7 


x-7 


X7 


x-7 




.28808 


.28808 


. 28849 


. 28853 


.95 


3.471 


3.471 


3.466 


3.466 




x = 14 


x14 


x15 


x-14 




.08105 


.08105 


.08107 


.08126 


.99 


12.338 


12.338 


12.335 


12.306 




x- 69 


x* 69 


x~65 


x-69 



Some numerical results for these procedures are given in Table I for q .90, .95 
and .99. One interesting result is that R 21 appears to be Identical with R Q1 . The 
author has proved this equivalence in general under the assumption that, for R Q1 , the 
right member of (19) has a unique minimum (or that the integer x at which the mini- 
mum in (19) is attained is such that F* (x;q) is given by (23) below )j but the assump- 
tion of a unique minimum has not been proved and hence the general result must be 
regarded as a conjecture. In view of the remark in Section 5 that in a G- situation the 
procedure R returns to an H- situation in a minimal expected number of tests, we 
have some grounds for conjecturing that R 21 is an optimal procedure in the sense of 
(15) or (16) for the case N , among the class of procedures for which the total 



number of units in all unclassified sets known to contain any defectives remains 
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bounded. For values of q close to unity, the results for R Q4 are very close to the 
results for R ZI (or R Q1 ) but the values of ^(q, R Q4 ) Iwtve not yet been computed for as 
many values of q as have the values * H (qR 21 >* Furthermore, the procedure R Q does 
not possess the property (that R 4 possesses) that it can be carried out without knowing 
the true value of q. 

Application; In some problems, units come off an assembly line (or a cbnveyor belt) 
and the size of the population is conceptually infinite, although the number of units 
available may actually be quite small at any given time. The rate at which units be- 
come available is assumed to be matched to the average rate of testing, so that the 
experimenter always has enough units to carry out any particular procedure. If every 
unit that comes off the assembly line has to be classified and the number of such units 
is not known at the outset, then the infinite model, i.e., the model with N , is 
appropriate. 
7. Properties of Rj for q Close to Unity. 

Returning to the case of a finite population size, it is shown in [4] that for a 
G(m, n)-situation with q in an interval ending at unity (i. e. , q(m) < q <1), the size 
x * x G (m;q,R 1 ) of the next group test is such that, depending on m, either x is a pow- 
er of 2 or m-x is a power of 2. A more precise statement of this result requires 
some definitions. Let x max x^^toq) denote the largest value of x assigned by R A 
in a G(m,n)- situation as q varies in the open interval, < q < 1. It has been numeri- 
cally shown for m < n < 16 (and is conjectured for all m, n) that, under R I , the integer 
X max occurs in ** interval of q-values ending at unity. For m.> 2, let the integers 
or (m), (m) be defined by 

(20) m- 20n>+ *(m) (0 <* (m) < 2 *>), 

which is consistent with (8). Then the above-mentioned result states that, under R , 
in any G(m, n)- situation, there is an interval of q-values ending at unity in which the 
value of x is given by 

x I**' 1 ** 



. 2 a (**i) or m> 3.2' 
As a corollary, it follows that for any G(m,n)- situation, under R^ 
(22) m/Sjj^jm/2. 

Also, In the above-mentioned interval, ending at unity, 
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F*<m)-(m) ( L ~ <f* } + q m-2/J(m) /l-g^^ 

1 u-q / \ i-q 



(23) 



2 - 

(24) H (n) - q + np + pF* (n) + p 2 S F* (j) 

1 a j =2 * 

and expressions for GJ (m, n) in terms of F* (j) for j < n are given in [4] . If we let 
q approach unity in these results then we obtain 

(25) lim H^n)- 1 

(26) lim F* (m) = ma (m) + 20 (m) 

[ 1 + a(m)] +2/3 (m) for n > m 



(27) lim G* (m,n) 

(m [ 1 + a(m) 3 + 2 (m) - 1 for n m. 

If the sign of the slope of these functions (at least for q close to unity) can be deter- 
mined then the above expressions furnish rough bounds for large value of q. For ex* 
ample, it is easy to show that F* (m) is continuous and strictly increasing for all q so 
that (26) furnishes an upper bound for all q. 
8. Conjectured Properties of R^ 

In this section we state some properties which appear to hold for Procedure R^, 
based on numerical calculations for N < 16, but which have not been proved for all N. 

A. For any G- situation with fixed m > 1, if x (m;q) denotes the size of the next test 
group under R then x (m;q) is a non- deer easing step function of q with step size un- 
ity, i. e. , for any q<q + t<lc>0 f 

(28) x (m;q)<x (m;q+) 

G G 

and for sufficiently small c 

(29) x G (m;q + ) < x G (m;q) + 1. 

Also for fixed q the value of x_(m+ l;q) is either the same or one greater than 
x G (m;q), i.e., 

(30) 3C G (m;q)<x G (m-H;q)<x G (m;q)-i-l. 

The assumption used in Section 7 that the largest x- values are associated with the 
largest q- values is a simple consequence of (28). 

B. For any H- situation we can define x, (n;q) similar to x G (m;q) and the property cor- 
responding to (28) still holds; namely that for q < q + c < 1 and all positive integers n 

(31) Xjjfoq) jXjjtoq-h ). 

It is clear from the tabulations in [4] that the analogs of (29) and (30) do not hold for 
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C. Assume it to be given as part of the problem that it is impossible or economically 
impractical to identify or keep separate the individual units within any set formed in 
the course of carrying out procedure R, . Then the past history of individual units is 
lost and it can be assumed that, after each test on a batch of x units, the disposition 
of the x units is made on a batch basis. In this problem it is conjectured that the pro- 
cedure R. is the optimal procedure for all values of q. 
9. Properties of R. for q Close to Zero. 

The procedure R. has the interesting property that, for q < q Q =(1/2)(Y5-1) t .618, 
for all integers m, n(2 < m < n), the units are all tested one at a time. This same pro- 
perty was recently shown by Ungar [6] to hold for the optimal procedure for any N 
(without specifying what the optimal procedure is like for any q > q }. The same pro- 
perty also holds for the "information" procedure R_, for the procedure R_ defined in 
[4] , and for the "mixing" procedure R Q defined in Section 13 below. It also holds for 
several generalizations discussed in Section 11. 

A formal statement of the above result for R will now be stated as a theorem. 
Theorem 2: Under procedure R. with 2 < m n and <_ q < q 

(32) x c (m;q) = ^(njq) = 1 

(33) H^n) n 

TV, 1 *- 1 

(34) G.fn^n). n- SZ - 



(35) 



1 P 1 - q m 

The proof is given in Section VHI of (43 . 
10. A Suggested Procedure for the Case of Unknown q. 

It is reasonable to expect that a knowledge of good procedures for the case of known 
q will suggest good procedures for the case of unknown q. From this point of view we 
consider modifications of the basic procedure Rj which make it adaptable when q is 
unknown. It is suggested that after each test we form a new estimate of q and that the 
procedure R J be used with the estimated value in place of the true value. At the out- 
set we can start with an estimate based on past experience or we can start by testing 
one unit at a time. A thorough investigation of the relative merit of this procedure has 
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not been carried out. Some discussion on the maximum likelihood method of estimat- 
ing q is given below. 

Let d and s denote the number of units proven defective and proven good, respec- 
tively, so that at any stage of experimentation we have 
(36) N d + s+m+ (n-m) . d+s + n. 

The likelihood L of the observed result (36) is given by 



Then it is easily shown that 

(38) -3! (log L) . - *_ {log L, , _0_ J d . (N . n)p+ 5SE3^ j . 

Setting the latter equal to zero, we find that for m> 2 the maximum likelihood estim- 
ate q of q is a real positive root of the m" 1 degree polynomial 

m 

(39) s -d S q l + (m+s)$ m 

i= 1 

and for m we have q s/(d+ s), the usual estimate. For s and m +d 1 we 
get q - and for s^ 1 it is easily seen, using Descartes "Rule of Signs, " that (39) 
has exactly one real positive root q which must lie in the unit interval and hence q is 
uniquely defined. The remaining case s - m - d = can only occur at the outset when 
there is no observation on which to base an estimate. It is interesting to note that the 
same result (39) can also be obtained by computing the conditional expected proportion 
of defectives among the N units given the observed s, d, m, and n and set it equal to 
1 - q. The equation thus obtained is the same as (39) and its root is q. 

It may be desirable to test several units one at a time at the outset until a stable 
estimate of q can be obtained. If the first estimate of q is based on past experience 
then it is desirable that past experience together with the past observations should en- 
ter into the second, third and other early estimates of q. Otherwise, we may obtain 
sudden jumps from small test groups to large test groups and vice versa, both of 
which are undesirable. 

The above method of getting an estimate is being suggested in connection with pro- 
cedure R, but It can also be used in connection with procedure R^ without any change. 
For procedure R S of [4] we can have several defective sets and several binomials at 
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any one time and a generalization of (39) is given in (4] for this case. A discussion 
of the asymptotic variance is also given in Section DC of [4] . 
11. Some Generalizations of R. 



Returning to the case of known probabilities q, we now mention some different gen- 
eralizations of the same basic problem. The same method of deriving a recursion 
formula with boundary conditions is applicable to most of these problems and, in some 
cases, the details can be found in Section XI of [4] . 

1. Two (or more) different kinds of units with known probabilities (say, q^ < q,) of 
a good unit are present and both can be put into the same test group. In this case it 
turns out that units are tested one at a time in both G- and H- situations if 

1 ~ q 2" q l q 2 >0 * 

2. Two (or more) experimenters are working on a single set of N units by carrying 

out simultaneous, parallel group tests (each of which takes the same fixed time) and 
cooperating so as to minimize the time required to complete the classification. It is 
shown in 14] , for N =4 with two experimenters, that with cooperation the expected 
time can be made smaller than if each experimenter were given two units and told to 
work independently of each other. This improvement is at the expense of a slight in- 
crease in the expected total number of tests. 

3. The basic problem is to be carried out under the. added restriction that no one 
unit can be included in more than k group tests. This is particularly appropriate in 
the blood testing application where a single blood sample can be divided into k equal 
portions (one for each test) and the patient does not want to be annoyed by having more 
than one blood sample taken. In this problem it is necessary to work with vectors; for 
example, nf . (m Qf aa jf . . . ,m fc _ j) is used to denote the entire defective set and m. de- 
notes the number of units in m that have already been included in j tests (j - 0, 1, . . . , 
k-1). Recursion formulas, with vector arguments, are given in [4] . It appears 
that, in this case also, units are tested one at a time if q < q^ i . 618. Tte case 

k 2 is necessarily based on the method of Dorfman [ 1 3 , i. e. , if a group-test fails 
then the units therein are all tested individually. The case k 3 has been numerically 
computed for small values of N and for all q; this will be published separately. 

4. Various generalizations appear if it is assumed that each test on x units gives 
three (or more) different possible results. For example, a test could indicate that 
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either i) ail are good or ii) all are defective or iii) there are at least one good unit and 
at least one defective present. 

5. If a unit can be defective in either of two ways (say, electrical or mechanical) 
with a priori probabilities of being defective independent but not necessarily equal and 
if there are two different tests corresponding to the two types of defectives then in 
addition to deciding the next test group size it may also be necessary to decide which 
test to use next. 

6. For positive continuous chance variables (like weight) the following problem is 
analogous. A bag of N coins contains good coins of constant weight, say unity, and 
faulty coins whose weights independently follow a known distribution, which allows all 
values greater than unity and no values less than unity. Any number of coins can be 
employed in a single weighing. The problem is to find a procedure for classifying 
each of the coins in a minimal expected number of weighings, assuming that each coin 
has known a priori probability q of being good and p 1 - q of being defective. 

Many of these generalizations have not yet been fully investigated. 
IE. Bounds from Information Theory and Coding Theory. 

A lower bound of H(n)/n E {Tjq, n, R } /n for any procedure R, which depends on q 
but not on n, can be readily obtained from information theory. Thus the entropy (or 
information, measured in bits) associated with the classification of n independent bi- 
nomiaily-distributed observations with parameter q (or p * 1 - q) is given by 

(40) ynjq) .-z 



( f )p l q n " l log 2 (p^ 1 ) n |p Iog 2 p +q Iog 2 q 1 



This entropy must be equal to the expectation (with respect to the chance variable T) 
of the entropy associated with a succession of T group tests, which always terminates 
when (and only when) all the units are classified. Since the entropy associated with 
each test is at most unity, we have for any procedure R and any q 
(41 ) H(n) > - nip Iog 2 p + q Iog 2 o] 

and dividing both sides by n gives the desired result. In particular, for procedure Rp 
this gives . 08079 as a lower bound for H^nJ/n for all n. It follows from Table H that 
the difference .08320 - .08079 .00241 is an upper bound on the difference between 
H 1 (100)/100 and the optimal value of H(n)/n attainable under any procedure; further- 
more, if n , this upper bound decreases with n and, by Table I, it approaches 
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. 08105 * .08079 .00026. This shows that the lower bound (41) is not attainable under 
procedure Rj for any n, at least for q . 99; it is conjectured that it is not attainable 
under any group-testing procedure for any q, except q 1/2. In particular, for q ap- 
proaching zero or one and fixed n, the lower bound (41) approaches zero while, for 
the best procedure R, H(n;R) n for q * and H (n;R) * 1 for q * 1. Although the 
lower bound (41) is not generally attainable, it gives a close numerical approximation 
to H (n) for large n, provided that q > 1/2. The right-hand member of (41) will be re- 
ferred to as the "Information Lower Bound 11 or IJLB. 

As a result of the construction in Section 14 below, it can also be shown for the pro- 
cedure R. that 

(42) Hj(n) <-n[p Iog 2 p + q Iog 2 qj + np 

so that, using (41), H^(n)/n differs from the ULB/n by at most p. This is not as strong 
as the numerical results above for p * .01, but it is simple, general and does not re- 
quire any computation. 

Lower bounds for F(m) and G(m,n), corresponding to (41), which hold for any pro- 
cedure R and any values of m, n and q, are 



(44) G(m,n)>-S tog + (n- i)(p Iog 2 p + q log., 



q) 1 
J 



1 - 



- q m ) Iog 2 (1 - q m ) 



- I n+ 2 - I (plog p + qlog q). 
\ 1 - q m / Z 2 

It is also possible to obtain a better lower bound for each q by the application of a 
result due to Huffman [ 3 ] in coding theory. Starting with n binomial units, we can re- 
gard them as ordered. Since each unit is good or defective, there are 2 n possible 
states of nature, one of which is true. If we represent each test that succeeds by the 
digit 'zero 1 and each test that fails by the digit 'one, ' then a procedure (for any fixed 
q) is identical with a binary code. Moreover, a particular set of test outcomes (or a 
particular stopping point) corresponds in a one - to - one manner with a particular 
"word" of the code. (In all procedures of interest, the number of stopping points is 

146 



exactly 2 n , one for each state of nature. } Then the expected number of tests required 
is identical with the expected word length (i. e. , the cost) of the code. Huffman [3] 
gives a routine for finding the code with the smallest cost. Starting with 2 n states of 
nature with known probabilities, which for our problem are (q n ,pq n ~ * f pq n ' \ . . . , p n ), we 
can construct the optimal code or at least find its cost. This optimal code may or may 
not correspond to a group-testing procedure but its cost will be a lower bound to the 
expected number of tests required for any group-testing procedure. Unfortunately, 
there is no simple analytic expression known to the author for this cost, only a routine 
for its computation, which is very time -consuming to compute for large n even on 
modern electronic calculators. Some numerical values of this lower bound (which we 
denote by HB) are given in Table H for q . 90, . 95, . 99 and small values of n. 

To explain the computation, let Q.(i 1, 2, . . . , I 2 n ) denote any set of a priori 
probabilities that sum to unity. Order the Q., add the two smallest, reorder the re- 
maining set of I- 1 probabilities, add the two smallest, reorder the remaining set of 
1-2 probabilities, etc. Let Sj denote the sum of the two smallest probabilities at the 
3 th step (j 1, 2, . . . , I - 1), so that Sj _ l 1. Then the Huffman lower bound (HB), 
which depends on q and n, is given by 

1-1 

(45) HB - S S.. 

i=l 

In every case, this appears to be a greater lower bound than the ILB; Table n shows 
that the improvement is best for values of q very close to unity. 
13. On the Lack of Optimality of R X . 

In [4] it is shown that for q close to unity, starting with a finite binomial set, it is 
possible to define procedures that are better than R, . Such procedures will neces- 
sarily involve "mixing 11 , i. e. , at least one group test is performed on a mixed set of 
units, some of which are taken from a binomial set and the rest from a defective set, 
with neither subset empty. A particular procedure R Q , which allows a limited amount 
of mixing, will now be defined. No mixing procedure has been found which is better 
than R even for a single value of q. It Is known that R Q is optimal for very small 
values of n (like n 2, 3 and 4) but it is not known whether or to what extent it con- 
tinues to be optimal for intermediate and large values of n. Some numerical results 
for RQ are given in Table n, in comparison with the results for R^ and the lower 
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Test Mixing Routine for G Q (2, n) 

Number [ m = 2, n > 2, q > q n (2> a) 3 



Mixing Routine for G()(3, n) 
[m=3, n >3, q>q n (3 t n)] 



<a r b 3 ,b 4 ,...,b n ) 



End 



3 End 



4 G Q (n-2,n-2) G n (n-l,n-l) 




End(a 2 ,a 3 ,b 4 ,b 5 ,...,b n ) 



End 



G Q (n-3,n-3) 



Test 
Number 



Mixing Routine for G Q (3, 3) 
Cms n= 3, q> .843] 



End 



End 



End 



End 




G (2,2) 



2; Mixing Routines for Procedure R Q [Use only for m= 2 or 3 and q > q Q (m, n)J 
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bounds discussed in Section 12. 

Procedure R Q : Mixing is allowed only when the size of the defective set m is two or 
three. Three cases are considered according as n> xn 2, n>m3ornsm3; 
the case n m = 2 and all other cases with m > 3 are treated in a manner similar to 
that of R,, without mixing. In each of the three cases, we define a short "routine 11 , at 
the end of which (if the test hasn't already terminated) the a posteriori distribution is 
exactly the same as in a G(m, n)- situation with m u. Let a^, a^, ... a jn denote the 

defective set and let b , b _,..., b denote the binomial set at some stage of 
m *f 1 m +2 n 

experimentation just before the mixing routine is applied. The three diagrams (or 
trees) in Figure 2 will explain in detail the three mixing routines, which are used only 
for q sufficiently large, i. e. for q > q Q (m, n). In each case the value of q^m.n) is 
never less than .843. 

Let G" (m,n) denote the expected number of (additional) group-tests required to ter- 
minate experimentation if we start at the beginning of one of the above mixing routines. 
Then for the three cases n> m 2, n> m - 3, n m - 3, respectively, assuming in 
each case that G (m, n) has already been defined for smaller values of n, we obtain 

(46) G (2,n) - 5Eli + Pq(l-1- 2 ) r 3 + G (n- 2. n- 2)] 

1- 1- ^ "^ 



1 - q 2 

(47) G (3.n) -iS4 + P^-S'" 3 ) [5 + G (n-3. n-3)] 

3 3 U J 



1-q 



(48, 

1-q 

Let G 1 (m, n) for n > m_> 2 be defined (without mixing on the next step but possibly 



with mixing on subsequent steps) by 
(49) G (m,n) .1+ 



assuming G Q (m, n) has already been defined for smaller n values and also if the n 
values are the same, for smaller m values. The recursion formulas, which define the 
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procedure R Q , are given for all q In terms of (46), (47), (48) and (49) by 
(50) H (n).l + min (q^H^n-x) +(1 -q x )G Q (x, n)} for n> 1* 



(51) G (m, n) G' (m, n) forn>m>4 

(52) G o^ m ' n ) * min { G o^ m> n )' G o ^ m * n $ for n > m = 2, n > m = 3 

and n m 3 



(53) G (2,2). 

The boundary conditions are the same as for R , namely 

(54) H (0) " and G (1 ' n) " H (n " X) ( Q = 1 2 ---)- 

The problem of giving instructions for carrying out R Q is more complicated since 
for any G^m, n)- situation with values of q close to unity the size of the next test group 
depends on both m and n. In particular, if q-U, n) and q Q (3, n) denote the left end 
points of the interval where the appropriate mixing routine is applied for m 2 and 3, 
respectively, then it is clear from the computations that these points vary with n. In 
fact, the computations show that these points are non- deer easing functions of n and 
approach unity for both m 2 and m 3. In other words, for any fixed q, the proce- 
dure R Q appears to disregard mixing for all n > n Q (q). 

Instructions for carrying out the procedure R Q are given in Figure 3 for all q for 
m 1(1 )n and n 1(1)8. The sum of two numbers (say, d Q + b ) in Figure 3 indicates 
that the appropriate mixing routine is to be performed, by mixing d Q units from the 
defective set with b Q units from the binomial set, for the next test. 

It will be useful to give an explanation (not a rigorous proof) of why mixing routines 
are introduced for m 2 and 3 and not for m > 4. For q asymptotically close to unity 
(q^ 1), it can be assumed that a defective set has exactly one defective unit and that a 
binomial set probably has no defective units. Let G(m,n Jq 2 1 ) denote the limit of 
G(m, n) as q tends to unity for any procedure. For any fixed n with m - 2 < n or 
m.3<n, we wish to prove that G(m,n|q*l) is smaller than G' (m,n |q^ 1), thus 
showing that mixing is preferable for large q. For m 2 < n, we obtain 

(55) G(2 f n|qtfl). (1/2)1 +(1/2)2 -3/2 <2 G(2,n J q 2 1). 
For m 3< n, we obtain for the mixing routine described above 

(56) Gg(3 v afq?l) 2 < 7/3 -(1/3)2 +(2/3)(H- 3/2) = GJ(3,n |q * 1), 

using the first member of (55) to compute the last member of (56). For any fixed n 
with m = 4 < n, we wish to prove that Gjj (m. n j q * 1) > G^m, n| q * 1), thus showing 
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that mixing is not necessarily preferable for large q; here it is assumed that G" (m,n) 
is defined for m_> 4 in the best manner similar to the above mixing routines. For 
xn* 4 < n, -we obtain for the best mixing routines 

(57) Gg(4,nfq^l)5/2G{ ) (4,n|qS'l).G (4,n|q^l), 

using the first member of (55) to compute the last two members of (57). Similarly, 
for n > m > 5, we find that G" (m, n |q* 1), under the best mixing routine, is no better 
than G' (m, n |q^ 1) and there is no clear advantage in using any mixing routine. 

It is clear from the definition of R Q that it must be at least as efficient as R, for all 
m, n and q. Equations (55) and (56) show that it is actually better for q sufficiently 
close to unity. Numerical computation indicates that (if we start with all units in the 
binomial state) R Q and R I are identical for < q < (l+V"33~)/8 * . 843 and that R Q is 
better for . 843 < q < 1, provided n > 3. In particular, it follows from the above that 
under R Q (as for R^) all units are tested one at a time for q < q Q (V5 - 1)/Z = . 618. 

It is also interesting to note from the three mixing routines in Fig. 2 that R Q pre- 
serves the "first come, first served" property. 
14. An Alternate Method for Carrying Out Procedure R}. 

Since the procedure Rj has been tabulated only for n * 2(1)16 for all q and for 
n * 17(1)100 for q .90, .95 and .99, it is desirable to have a method of carrying out 
the procedure for any q and any n, by an algorithm which does not require recursion 
formulas, and which therefore permits one to compute x's for a particular q and n, 
without building an entire table. In this section, such a method will be described with- 
out giving the proof that the procedure is equivalent to Rj. The Huffman routine [3] 
and the identity 

(58) p + qp + q 2 p +.+ q n ~ *p + q n - 1 

play a central role in this method. Numerical values for the terms in (58) are used 
and, beyond that, the only computation required is the ordering of probabilities and 
the addition of pairs of probabilities. The method will be explained with the use of a 
particular example, viz. , n - 10, q .90, but the same method can be used for any 
pair (n, q). 

The first step is to carry out the Huffman routine on the n + I =11 probabilities in 
(58), i.e., the terms are ordered, the two smallest are added, the resulting set of 
n = 10 probabUities is reordered, the two smallest are added, etc. The scheme can 
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FIGURE 3 

Diagram Showing the Number of Units to be Taken in any H- situation or any G- situa- 
tion for n - 1 through 8 and m < n under Procedure R Q . 

(Those G- situations, which will never arise if we start with an H- situation, are 
omitted from the diagram. ) 
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be diagrammed as in Figure 4. 

Another way of describing the way terms are added in Figure 4 is by means of 
brackets as follows 

V- + q J I q JJ 

Each term Tj (I 1, 2, . . . 11) can be associated with a positive integer k^ which is the 
number of brackets it is contained in. For the ordering of the 11 terms given in (59), 
these numbers are 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 2, respectively* It is easily verified that 

11 
(60) S 2~ K il 

and this result is shown in [7] to hold in general for any sum. Note that the position 
of the pairs of brackets corresponding to the next to the last sum (or the first major 
separation) can be found by summing the 2 ^ in the order given until the value 1/2 is 
obtained. This takes six terms in the above example and this corresponds to the fact 
that the first major separation in (59) is into the first 6 and the last 5 terms. 

It is now stated (without proof) that we can insert brackets in the left member of (58) 
in a (unique) manner so that 

i) the order of the terms remains as it is in (58) 

ii) each term Tj has the same kj as in (59) (1 1, 2, ... 11). 

To accomplish this, we first rewrite the eleven k^ in the order given by (58); we obtain 
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 2. The first major separation is into the first 6 and the last 5 
terms, since the first six 2" ^ and the last five 2~ ^ each add to 1/2. The next major 
separation is to break up the first 6 terms into two parts, containing 2 terms and 4 
terms, and to break up the last five terms into two parts, containing 4 terms and 1 
term, since 2" ^ * 1/4 in each of the four parts. This break up is continued and, fin- 
ally, we obtain the result 




The brackets in (61) describe the method of carrying out Rj. Since the first major 
separation is between q 5 p and q 6 p, we take x - 6 on the first step to determine whether 
the first defective is among the first six units or whether the first six are all good. 
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If, for example, the first six are good then we proceed to the right in (61) and the next 
separation is between q 9 p and q 10 ; this indicates that we should then test the remain- 
ing 4 units. If the first six had at least one failure, then we proceed to the left in (61) 
and the next separation is between qp and q 2 p; this indicates that we should test 2 units 
from the defective set of size 6, etc. 

Either the procedure indicated by (61) leads to termination or it leads to the detec- 
tion of a single defective unit, at which point a new H- situation is obtained with a smal- 
ler number of units. Starting all over again, if necessary, with the same q and a 
smaller n- value, the same process is repeated until either all units are classified or 
another H- situation is obtained with a still smaller number of units. This is repeated, 
if necessary, until all units are classified. 

The proof that this is equivalent to Rj will be published separately. 

In the above construction the emphasis is on the way one proceeds from one H- situa- 
tion either to termination or to a subsequent H- situation after a single defective is re- 
moved, whichever comes sooner; this can be regarded as a subproblem. It can be 
shown on the basis of the above construction that in each subproblem the expected num- 
ber of tests required to complete the subproblem is equal to the Huffman lower bound, 
i.e. , the cost associated with the Huffman routine for the subproblem. Hence the pro- 
cedure RI is optimal within each subproblem, although, if q is close to unity, N is 
finite and the units are all identified, it is not optimal for the problem as a whole. 
15. The Information Procedure. 

Another procedure, R, is based on choosing that size x for the next group test 
which maximizes the entropy (or information) associated with the next group test. We 
shall refer to this as the "information procedure". 

For an H- situation, the next group test has two outcomes with probabilities q x and 
1 - q x and the associated entropy (measured in bits) is given by 
(62) IjjCxjq) . - [q* log z q* + (1 - q*) Iog 2 (1 - q*f] . 

To maximize (62), we use the fact that (p log 1/p + q log 1/q) attains its maximum at 
p = q - 1/2. Hence x is taken to be the positive integer such that q* is closer to 1/2 
than is q x " * or q x * . For any particular integer, there is an interval of q values for 
which the same x is chosen. The left and right endpoints, respectively, of this inter- 
val are the roots of the equations 
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(63) (l/2)(q x +q x - 1 ) 1 / 2 and (l) (q x + l + q*U 1/2. 

Hence the solution is known if we have a table of dividing points (i.e. , q- values) which 
separate x from x + 1 for each positive integer x; such a table is given in Table VII of 
[4] . 

The above is based on the fact that 1 - q x - q*" 1 " * has a unique root for every positive 
integer x and that this root is a strictly increasing function of x. It is interesting to 
note that the solution above does not depend on the size n of the binomial set. 

For a G- situation, the entropy associated with the next group test, based on x units 
taken only from the defective set, is 

<> 

To maximize (64), we choose x so that q x is closer to(l) (1 + q m ) than q x ~ 1 or q x " i " 1 . 

For a fixed integer x, there is an interval (which may be empty) of q- values for which 
the same x is chosen; the dividing point between x and xf 1 is the unique root in the 
interior of the unit interval of 

(65) l-q lc -q aE + 1 + qf ai -0. 

whenever (65) has such a root. If the root q 1 is removed, then (65) becomes 

(66) H.q + q2+...+q*-l- (q* + U q x + 2 + + q m - l ) 0, 

which clearly has at most one positive root. If the root is not present for some pair 
(x, m) then x 4- 1 will never be used for that m under procedure R_. Since the left mem- 
ber of (65) is a strictly increasing function of x, then for any fixed q with < q < 1, 
m> 2 andx > (m- l)/2 

m- 1 m + 1 

(67) l-q 3C -q X<l ' 1 +q m > d-q"^ ) (1-q 2 ) > 0. 

It follows that the largest x for which the root is present is such that x + 1 < (m + l)/2 
or x+ 1 <: m/2 and hence, under procedure R 2 , we never take a test group of size 
greater than m/2. It is interesting to note that the solution in this case depends on m 
but not on n, as in R. . Under procedure B^, the dividing point between x 1 and x= 2 
is the same for each m as under the procedure R^ (see Table m B of [4] ). 

Let F 2 (m), FJ(m), G 2 (m, n), G*(m, n) and H 2 (n) be defined for procedure R Z exactly 
as they were defined for procedure R^. Then 
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(68) F*(m). S q 1 " 1 + q x F*(m-x) + FJ(x) (m>2) 

i * 1 

where x is given by the above discussion or by Table VH (G- situation) of [4 ]. Equa- 
tions (9) and (12) also hold with all subscripts 'one 1 replaced by 'two 1 . Finally, 

(69) H 2 (n) l + q x H 2 (n-x) + pGJ(x,n) 

x 

H-q*H (n-x)+pF*(x)+ p S q U1 H(n-i) 
2 il 

where x is given by the above discussion or by Table VII (H-situation) of 4] . The 
boundary conditions state that F*(l) * H (0) * for all q. Exact polynomial expres- 
sions for FJ (m) and H,(n) are given in Tables Vm and VI, respectively, in [4] . It 
should be noted that both F*(m) and H^(n) may have points of discontinuity; at such 
values of q the polynomial which gives the smaller expectation (and the corresponding 
x- value) should be used. It should be observed in the numerical comparisons of Table 
ULA of [4 ] that the procedure R, compares quite favorably with the procedure B., 
for all values of q. Moreover, the fact that the dividing points are easier to compute 
makes it easier to apply, since the dividing points for R. are known only tip to n * 1 6. 
It is also interesting to note that the limiting expressions in Table IHA of [4] as 
n -o and in Table HIB of [4] as m o are the same as the second equation 
in (63). 

It is possible to devise a sequence of procedures R^ such that RJ 1 ^ RO and 
R 2 Rj for j sufficiently large. Under the procedure R^ , we choose for the size 
of the next group test that positive integer x which maximizes the expected information 
to be obtained from the next j tests in the following sense: it maximizes the ratio of 
the entropy associated with the next set of (at most) j tests to the expected number of 
tests, given that the number of tests will not exceed j. For all these procedures, in a 
G- situation, the next test group is taken only from the defective set. 

In the special case when there is no possiblity of stopping before j tests, then we can 
disregard the denominator (which is a constant j) and simply maximize the informa- 
tion. Since this is always the case for j 1, then RJ 1 * - R Z . 

Let M 2 (m, n) be defined as the maximum number of tests needed in a G(m, n)- 
situation under procedure R 2 . Then the associated entropy is independent of x. (It 
is given by the right-hand members of (41) and (44) in the H- and G- situations, re- 
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spectively. ) Hence for j > M J '(m,n) the numerator above can be disregarded (i.e. , 
treated as constant) and the problem is to choose that x which minimizes the denomi- 
nator or the unrestricted expected number of group tests, i.e., R ? R l for 
j > M^ (m, n). The value of M^'(m, n), for all j > 1 appears to be the same as 
M (m,n) defined in (7) for procedure R , but this has not been rigorously shown. In 
all these procedures it is the number of tests required when q is close to unity and the 
units to be tested happen to be all defective. 

For any H(n)- situation with n > 4 and j > 2, these procedures appear to eliminate the 
strategy of taking x * n- 1. For example, if n 4, j * 2 and q> . 618, then we wish to 
compare x - 2 and x = 3. For j 1, the dividing point between x 2 and x= 3 is 
q= .755. Since neither x= 2 nor x= 3 results in termination after one test, we disre- 
gard the denominator and compare (for the starting values x = 2 and x 3) the maxi- 
mum entropies associated with the next two group tests. The results show that x 2 
is preferable to x 3 for all q > .618. The same result holds for all j > 2. Then we 
find that for j = 2 the dividing point between x = 2 and x 4 is .789. For R the cor- 
responding dividing point is .786. 

For all j, we state without proof that in both G- and H- situations under procedure 
R^J) the units are all tested one at a time for q < (1/2) ( VT- 1) . 618. 

This sequence of procedures R^^ explains why R^ = R_ is not optimal (it takes into 

2 t , 

account only the very next test) and how its efficiency can be improved by increasing j. 

Hie most efficient procedure in this sequence is R^. 

The author wishes to acknowledge that some of the material in Sections 12 and 14 
arose from conversations with Professor Warren Hirsch, New York University, and 
A. Ross Eckler, Bell Telephone Laboratories. Thanks are also due to Miss Dorothy 
Kriechbaum, Miss Phyllis Groll, and Miss Ann Graztano, Bell Telephone Laboratories, 
for their help with the computations in this paper. 
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TABLE 



Comparison of Expected Number of Tests for Rj and RQ and Lower Bounds for Any 
Procedure Starting with a Binomial set of Finite Size, n. (The three entries in each 



cell correspond to q .90, .95 and .99, respectively. } 



H(n;q.R ) 



Information 
Lower 

Huffman Lower Bound Bound 

HB(n.q) ILB(n, q) 



2 


1.290 
1.148 
1.030 


1.290 
1.148 
1.030 


1.290 
1.148 
1.030 


0.938 
0.573 
0.162 


3 


1.661 
1.340 
1.070 


1.627 
1.307 
1.060 


1.598 
1.300 
1.060 


1.407 
0.859 
0.242 


4 


2.051 
1.538 
1.110 


2.019 
1.505 
1.100 


1.973 
1.469 
1.091 


1.876 
1.146 
0.323 


5 


2.490 
1.771 
1.159 


2.449 
1.714 
1.141 


2.401 
1.681 
1.131 


2.345 
1.432 
0.404 


6 


2.943 
2.009 
1.208 


2.911 
1.956 
1.183 


2.825 
1.897 
1.172 


2.814 
1.718 
0.485 


7 


3.414 
2.252 
1.258 


3.381 
2.191 
1.232 


3.320 
2.126 
1.213 


3.283 
2.005 
0.566 


8 


3.904 
2.499 
1.308 


3.867 
2.439 
1.282 


3.806 
2.390 
1.257 


3.752 
2.291 
0.646 


10 


4.872 
3.039 
1.425 


4.834 
2.977 
1.384 


4.767 
2.920 
1.362 


4.690 
2.864 
0.808 


12 


5.790 
3.594 
1.543 


5.755 
3.533 
1.492 


5.640 
3.449 
1.467 


5.628 
3.437 
0.969 


20 


9.572 
5.940 
2.051 


9.536 
5.872 
1.977 


n. c. 


9.380 
5.728 
1.616 


40 


19.024 
11.671 
3.478 


18.988 
11.607 
3.384 


n. c. 


18.760 
11.456 
3.232 


60 


28.475 
17.438 
5.026 


28. 439 
17.372 
4.936 


n. c. 


28.139 
17.184 
4.847 


80 


37. 925 
23. 197 
6.647 


37.889 
23.132 
6.557 


n. c. 


37.519 
22.912 
6.463 


100 


47. 375 
28. 959 
8.320 


47.339 
28. 894 
8.227 


n. c. 


46. 899 
28. 640 
8.079 



n. c. entries were not computed. 
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SOME OPEN PROBLEMS IN THE FOUNDATIONS OF SUBJECTIVE PROBABILITY 1 

Patrick Suppes 

Not only am I the only speaker who is a philosopher, but I am probably the only per- 
son attending this conference who is a philosopher; thus I should be expected to give 
you some words of wisdom. But I really do not have any such words to say. More par- 
ticularly, I do not want to offer any general defense of subjective probability, or the 
meaning of subjective probability. I do not mean to admit by this that I am unwilling to 
offer such a defense. It is just that. I do not want to rehash an old story. This morning 
I am going to talk about more limited problems than a general defense of the meaning 
and possible applications of notions of subjective probability. Secondly, in talking 
about problems of subjective probability, I will talk about some problems which inter- 
est me. I will not maintain that these problems are the most important, or the most 
interesting to everyone - they are problems which have interested me. Thirdly, I will 
be talking in the framework, particularly in the first part of the talk, that Savage in- 
troduced in his book, Foundations of Statistics. 

The kind of model introduced in that book is as follows: there is a set S of states of 
nature, a set C of consequences, and a set D of decisions or acts which are functions 
mapping S into C. The decision-maker *s problem is to choose from the decisions or 
acts that are available one which is in some sense optimal. The analysis which Sa- 
vage's book leads to is the standard MEU behavioral pattern (maximization of expected 
utility). Savage introduces seven axioms in terms of an ordering relation ^ on acts 
or decisions. For example, Axiom 1 asserts that this relation is transitive and con- 
nected. By connected I mean we can weakly choose between any two acts. Naturally 
though, this axiom does not take us very far. The upshot of the six additional axioms 
is to yield the MEU result; namely, that if the postulates, in terms of this relation on 
acts, are satisfied, then we can show that in choosing an act from the set available, a 



A The research on which this expository paper is based has been supported by the 
Group Psychology Branch of the Office of Naval Research. 
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person is maximizing expected utility. We mean by this that the person has a utility 
function on the set of consequences and a subjective probability distribution on the set 
of states, and the expectancy is with respect to this subjective probability distribution 
on the set of states. 

This kind of maximization of expected utility behavior is not a notion which in any 
sense originates with Savage; it is very old - in fact it goes back to James Bernoulli 
in the 18th century. Within this kind of framework there are two major classes of pro- 
blems that I would like to discuss. The first class of problems, in a certain definite 
sense, is oriented toward normative behavior, i. e. , telling a person how he should 
behave. The second class of problems is oriented toward a descriptive application. 
To what extent can we use notions of utility and subjective probability to discuss or to 
analyze the actual behavior of people ? Under the normative heading I will be particu- 
larly interested in what I will call problems of axiomatizability and definability, and 
under the second general heading in what I call behavioristic problems. So let me now 
address myself to problems of axiomatizability and definability. I want to discuss cer- 
tain axiomatizability problems that we can raise and which seem to be interesting and 
somewhat difficult to solve. In discussing these axiomatizability problems there will 
be some notions perhaps not completely familiar. I will try to indicate intuitively the 
character of the results, even if I do not explicate all the technical details. 

1. Constant acts. A problem which arises immediately in the Savage framework is 
that of the constant functions or constant acts. By a constant act I mean one that yields 
the same consequence whatever the state of nature. In more formal terms, a constant 
act is a function in the set D whose value is the same for all arguments; that is, for 
all states of nature. Savage's analysis requires that D include the set of all constant 
acts. An earlier unpublished paper of Herman Rubin's, which assumes some quantita- 
tive postulates but is concerned with deriving the existence of a Bayesian distribution 
on the states of nature, also requires such acts. My own set of axioms [1 ] , analog- 
ous to Savage's but more closely related to the approach (1926) of Frank Ramsey to 
these problems, demands inclusion of the constant acts. 

I know of no analysis which does not require these acts, and yet I want to show by 
analyzing an example of Savage's just how difficult it is to interpret them. Suppose a 
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man is mixing a sue- egg omelet and has put five eggs in a bowl, the problem is what to 
do with the sixth egg. (For some reason he has a suspicion it may be rotten. ) For the 
moment, we will reduce the problem to two acts - Act I, put egg in a separate bowl; 
Act [, put egg directly in with other five. The states of nature are S. - the egg is rot- 
ten, S- - the egg is fresh. There are two possibilities. If he puts the egg in a sepa- 
rate bowl and the egg is rotten then he can replace it. He does not ruin the omelet. If 
he puts the egg in with the other five and the egg is rotten, he ruins all six. I will 
assume it is very difficult to separate out the rotten egg when it is mixed in with five 
good ones. On the other hand, it is troublesome and time-consuming to put the egg 
in a separate bowl. If the man strongly believes the egg is fresh, he is very likely to 
put it directly into the bowl containing the five other eggs. The constant acts now enter 
in the following way. In order to prove that the axioms of behavior yield an MEU re- 
sult, it is necessary (but not sufficient in this case) to extend our set of acts to include 
the constant acts. In particular, we need to have an act which, even if the egg is fresh, 
leads to a consequence of ruining the omelet. In other words, totally unrealizable acts 
are required in order to derive the MEU result. We can certainly, introspectively in 
some general way, understand what these acts mean. We cannot realize them. To my 
mind, it is a severe weakness of a theory which claims to be behavioristic to have such 
acts inextricably included in its formal setup; they hark back all too much to the ver- 
balistic tradition which Savage has so admirably criticized. It is, of course, not play- 
ing the game to adopt some ad hoc device like that of a random mechanism whose 
workings do not affect and are unaffected by goings on in the rest of the world. The 
assumption of such a mechanism is a patent deus ex machina and nullifies one of the 
primary aims of the Savage kind of analysis; namely, to extend the theory of rational 
behavior to areas of action where it is unnatural to think in terms of random mecha- 
nisms. 

2 - Theory of pure rationality. The axioms of the various systems of rational beha- 
vior which have been proposed by Ramsey, de Finetti, Savage, and others, including 
myself, may be divided into two classes. In the first class go those which may be 
thought of as holding anywhere and anytime. These I call pure axioms of rationality. 
An example of a pure axioin is the postulate that the preference relation on the set of 
acts is transitive. In the second class belong those which postulate some special struc- 
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tural property of the environment and possibly of the decision maker. These I call 
structural axioms. The main structural axiom in Savage's setup is, roughly speaking, 
that the decision-maker can partition the set of states of nature as fine as he pleases 
in terms of probability. The result of this axiom is that there must be, in any model 
satisfying Savage's axioms an infinity of states of nature, and given any probability 
no matter how small there is a set of states which has a probability no greater than c . 
Such a requirement has nothing in itself to do with the concept of pure rationality, that 
is, with the concept of making a rational decision. I consider it a structural imposi- 
tion, a limitation on the range of applicability of the theory. 

Savage 1 s axiom is, of course, not the only kind of structural assumption which can 
be made. In my Berkeley Symposium paper, the number of states of nature is arbi- 
trary and I depended on a different kind of structural axiom; namely, that between any 
two consequences the decision-maker can find another which is equally spaced in uti- 
lity between them. This axiom implies that, except in the trivial case of all consequen- 
ces being equally prized, there must be an infinity of consequences. In another pa- 
per [2] , Donald Davidson and I used the structural assumption that there are only a 
finite number of consequences which are equally spaced in utility. 

Two things about these structural axioms should be clear. In the first place, al- 
though I have used quantitative or semi- quantitative language in formulating them, all 
of them may be formulated in terms of very primitive and qualitative concepts. Se- 
condly, in all systems of axioms formulated within the Savage kind of framework with 
which I am familiar, such axioms are necessary to prove the MEU kind of result. And 
now I want to give some relatively fundamental reasons for this necessity. 

To begin with it will be desirable to have a more exact definition of the notion of a 
pure axiom of rationality. I say that an axiom of behavior is a pure axiom if and only 
whenever it is satisfied in a model M it is satisfied in any submodel of M. Consider, 
for instance, the axiom that the preference relation 4 on the set D of acts is transi- 
tive. Any ordered couple ^J r m<A t R> is a possible realisation of this axiom if A is 
a non-empty set and R is a binary relation on A. A possible realization *2 is a model 
of the axiom if the relation R is transitive on A. A model J' * < A 1 , R 1 > is a sub- 
model of the model ^.if A 1 is a subset of A and R 1 is the relation R restricted to the 
set A 1 . It is easily verified that any submodel of a model of the transitivity axiom is 
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also a model of the axiom, and consequently this axiom is pure. It may also easily be 
shown that the connectivity axiom for the preference relation is also a pure axiom. 
Suppose now we consider an axiom which says three things: (i) the preference relation 
is transitive on the set of acts; (ii) it is also coneected on this set; and (iii) there is 
one act which is (weakly) preferred to all others, that is, there is an act dj such that 
for ail acts d , d < d.. Now this axiom is pure if we restrict ourselves to finite moc 
els because any finite model having properties (i) and (ii) will also have (iii). How- 
ever, if we permit infinite models, then the axiom is no longer pure, because an infin- 
ite set which has a greatest element with respect to an ordering relation may have in- 
finite subsets which do not have such an element. For example, the set of all rational 
numbers x such that < x < 1 has 1 as its greatest element with respect to the natural 
ordering < , but the subset of numbers such that < x < 1 has no such greatest ele- 
ment. This axiom may suggest that structural axioms are always existential in char- 
acter, but this is not always the case; for instance, the one Davidson and I used [ 2] is 
not existential in form. 

The question I now pose is this. What are the possibilities of axiomatizing the theo- 
ry of pure rationality? In the first place, it is reasonable to restrict ourselves to re- 
cursive axiomatizations . A recursive axioxnatization of a subject may consist of an in- 
finite list of axioms, but there is a mechanical method for deciding whether or not a 
statement is an axiom. A simple example of a non-recursive axiomatization may be 
given for arithmetic, namely the single sentence "A statement S of arithmetic is an 
axiom if and only if it is true. " This axiomatization is non-recursive because it fol- 
lows from fundamental results of Godel and Tar ski that there is no mechanical method 
for deciding whether or not a sentence of arithmetic is true. 

Secondly, I shall restrict consideration to what are called in logic first- order ax- 
ioms; that is, we shall permit the variables which occur in the axioms to take as 
values only the elements of the set D of acts. This is a strong restriction, for it pro- 
hibits, for example, any Archimedean axiom which uses an integer- valued variable. 
The reason for this restriction is that I want to discuss some negative results of a 
metamathematical or logical character. The difficulties of obtaining any general re- 
sults on problems of axiomatizabUity when the axioms are not first-order are consid- 
erable. Having imposed the restriction of first-order axioms, it will be necessary to 
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consider only finite models, for it is well- known that if a set of first-order axioms has 
an infinite model it has an infinite number of models of different infinite cardinality. 
Consequently it is impossible to give for infinite models first-order axioms on the ba- 
sis of which the existence of numerical utility or subjective probability functions may 
be established. 

Thirdly, for purposes of simplicity it will be desirable to deal with a situation which 
permits only two states of nature s^ and s 2 with equal subjective probabilities. Thus 
" ('j) * * (* 2 ) where <r (s.) is the numerical subjective probability of state Sj. And 
in terms of expected utility we may then write for d., d. in D: d, < d, if and only if 

(1) o- ( 1 )u(d 1 ( >1 )) + <r (S 2 )u(d 1 (s 2 ))<.<r(s 1 )u(d 2 (s 1 ))-i- er (s 2 )u(d 2 (s 2 )), 
where u is the numerical utility function of the set C of consequences. Now since 

<r (Sj) * cr (s 2 ), we have equivalent to (1) 

(2) uCd^)) + u(d a (s 2 )) < u(d 2 (s 1 )) + u(d 2 (s 2 )). 
which in turn is equivalent to 

(3) ufd^sp) - u(d 2 ( 1 < u(d 2 (s 2 )) - u(d 1 <s 2 )). 

Whence the theory of pure rationality for this situation of two states of nature with 
equal probability reduces to axiomatizing the quaternary relation R on the set C of 
consequences such that there is a numerical function u on C with the property that for 
every x, y, z, and winG, xyRz w if and only if 

(4) u(x)-u(y)<u(z)-u(w). 

The transformation from the relation 4 on D to the relation R on C is made for tech- 
nical purposes. Several years ago I thought it would not be a difficult matter to axiom - 
atize R in terms of a finite list of first-order sentences so as to satisfy (4). The 
problem has not only proved difficult, but in fact Dana Scott and I have shown that it 
cannot be axiomatized by a finite list of first-order axioms none of which is existential 
in character [3] . Intuitively it seems that existential sentences cannot offer any real 
help when it is required that the axioms be closed under submodels, but we have been 
unable to back up this intuition with a formal proof. So even for this simple case, the 
problem of finite axioxnatization is not settled. 

It is possible to give a recursive axioxnatization of the relation R (for finite models) 
by enumeration of what are technically called the isomorphism types of R. We start 
with sets of cardinality one and list the single isomorphism type, and proceed in this 
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way for each finite cardinal n, listing the types in some fixed order. The difficulty, of 
course, is that this kind of recursive automatization is intuitively completely uninfor- 
mative. This is by no means always the case with recursive axiomatizations of a the- 
ory, as the standard axioms for elementary number theory or those for Zermelo set 
theory adequately testify. The negative proof given by Scott and me [3 ] depended upon 
showing that an infinite but recursive list of axioms which permitted "addition 11 of in- 
tervals is in a certain sense necessary, and at one time we thought a reasonably satis- 
factory recursive axiomatization could be given which used this addition schema and a 
finite number of additional axioms. Unfortunately Robert McNaughton produced a 
counterexample to this system of axioms. His counterexample consists of a set of 
twenty-two elements; it satisfies the addition schema but does not permit a numerical 
representation of the kind characterized by (4). It seems that the problem of finding a 
reasonably appealing recursive axiomatization is difficult. 

A fortiori these problems of axiomatization are unsolved for models which permit 
more states of nature. 

3. Behavioristic foundations of subjective probability and utility. From a psycholo- 
gical standpoint the most undesirable thing about the MEU result within the Savage kind 
of framework is its static character. There is no attempt to explain how an organism 
comes to have subjective degrees of beliefs about possible states of nature, or evalua- 
tions of the relative desirability of different possible consequences. There is no theory 
as to how the environment interacts with the individual. 

I have recently derived from the general assumptions of stimulus learning theory a 
utility for some simple choice situations [4 ] . I want briefly to describe these results 
and then to indicate some of the open problems. Stimulus sampling learning theory was 
first given a quantitative formulation in 1950 by the psychologist W.K. Estes, and has 
since been developed by a number of investigators. The basic ideas run as follows. 
The organism is presented with a sequence of trials on each of which he makes a res- 
ponse that is one of several possible choices. In any particular setup it is assumed 
that there is a set of stimuli from which the organism draws a sample at the beginning 
of each trial. It is assumed that on each trial each stimulus is conditioned to exactly 
one response. The probability of making a given response on any trial is postulated to 
be simply the proportion of sampled stimuli which are conditioned to that response. 
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Learning takes place by the following mechanism. At the end of a trial a reinforcing 
event occurs which identifies that one of the possible responses which was correct. 
The sampled stimuli become conditioned to this response, and the organism begins an- 
other trial in a new state of conditioning. 

Naturally this account of stimulus sampling theory is a highly simplified one, and 
yet it should be clear in what sense this theory is dynamic rather tfran static, and thus 
provides a theoretical analysis of how the organism is interacting with its environment. 

The kind of utility results obtained from this theory thus far are easily sketched. 
Suppose a person is on each trial presented with one of several pairs of slot machines. 
That is, on each trial he chooses which of two slot machines to play, but the pairs 
presented vary from trial to trial. (When there are exactly two slot machines, this is 
the familiar two- armed bandit problem. ) Let there be N slot machines with v j the 
probability of payoff of the i machine (the probability v ^ is not known to the person. ) 
Then the following utility function satisfying a requirement like (4) may be derived 
from stimulus sampling theory: 

u(i) - log , ( *_ ff ) . 

where c i is the learning parameter associated with the i machine. 

It is still far from clear how this kind of result may be extended to more complica- 
ted behavioral situations. Moreover, it is not yet clear how both subjective probability 
and utility functions may be derived from stimulus sampling theory even for very 
simple situations. Positive solution of these problems would provide yet another step- 
ping stone toward the construction of a psychologically sophisticated theory of actual 
inductive behavior. 
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Statistical Decision Theory in Engineering 
Lionel Weiss 

1. Introduction. First we make a quick survey of the two broad fields of "classical 
statistics": estimation and testing a hypothesis. In a problem of estimation, we are 
supposed to construct an estimate of an unknown parameter by using observations on 
a random variable. This estimate is either a single number (a point estimate) or a 
whole interval (a confidence interval). In a problem of testing a hypothesis, we are to 
decide whether or not the unknown parameter has some stated property (the hypothesis 
is that the parameter has this property). 

One great advantage of statistical decision theory is that it handles both the problem 
of testing a hypothesis and the problem of estimation as special cases of a much more 
general problem. The general problem handled by statistical decision theory can be 
briefly described as follows. We have to choose one decision out of a given s^t of 
possible decisions, after observing the jointly distributed random variables 

X., . . . ,X , whose joint probability distribution is not completely known, but is known 
1 m 

to be one of a given set of possible joint distributions. After the decision is chosen, a 
loss is incurred which depends on the particular decision chosen and on which particu- 
lar joint distribution is the actual distribution of X , . . ,X . (In some problems, the 

1 m 

loss may also depend on the observed values of X . . . . , X ). 

1 m 

The problem of estimation is a special case of a statistical decision problem, where 

the possible distributions of X., ... ,X are given by the variation of a parameter, and 

* m 

the possible decisions are the possible values of the parameter. Thus, if the decision 
chosen is denoted by D, and the true value of the parameter is denoted by 0, the loss 
might be (D - O) 2 , so our loss increases as our estimate gets farther from the true 
value. 

The problem of testing a hypothesis is a special case of a statistical decision problem 
where the possible distributions are broken into two groups, group I containing those 
distributions which satisfy the hypothesis, group H containing the distributions which 
do not satisfy the hypothesis, and there are two possible decisions, one decision being 
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to state that the distribution is in group I (that is. to accept the hypothesis), the other 
decision being to state that the distribution is in group II (that is, to reject the hypo- 
thesis). If the right decision is chosen, the loss is zero, otherwise it is some positive 
number. 

Of course, there are problems of statistical decision theory which are neither prob- 
lems of estimation nor problems of testing a hypothesis. Thus statistical decision 
theory offers simultaneously a mathematical generalization and a mathematical unifi- 
cation of classical statistical theory. 

However, it would seem that statistical decision theory as it has been described 
above (which is the usual description) is not directly applicable to problems arising in 
practice. Let us take a typical problem of estimation first. Suppose that each month 
a company sells a certain amount of its product, and that the amounts sold in the 
various months are independent, identically distributed random variables with a nor- 
mal distribution with unknown mean and variance. The problem is to estimate this 
unknown mean and variance, using the observations available. Suppose that this is 
done by the company statistician. Then what happens ? It is hard to believe that a 
company would retain a statistician to compute these estimates just because of idle 
curiosity. Presumably, these estimates will be used to forecast future sales. Then 
why not let the statistician estimate future sales directly? Even in an age of special- 
ization* it seems to be going too far to hire one man to construct estimates of para- 
meters and another man to make forecasts using these estimates. The point is that if 
the statistician is told to forecast future sales, this does not necessarily have to be 
done by first estimating the separate parameters. We can go farther, and ask why the 
company wants to forecast future sales? Clearly, to enable it to take the proper physi- 
cal or financial action indicated by the forecast: for example, to set the production 
rate at an optimal level. But then why not let the statistician go directly from the ob- 
servations to the physical action indicated by the observations, instead of breaking the 
problem into what seem to be artificial pieces ? 

Let us apply this same discussion to the general problem of statistical decision 
theory. In statistical decision theory, the loss depends on the decision chosen and on 
the true distribution. But in many cases this means that when the loss is actually paid, 
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we will learn exactly which distribution is the true one. This is so because we know 
the decision we chose, and once we know the loss that must be paid we can solve for 
the true distribution as the one that yields the given loss in combination with the known 
decision. However, it is difficult to imagine what mechanism would or could make the 
true distribution known to us, except for certain artificial game situations. It seems, 
then, that the loss actually incurred cannot really be a function of the true distribution 

of X , . . . ,X . What, then, will this actually incurred loss depend on? Recall the 
1 m 

problem of forecasting future sales discussed above; in general, the actual loss will 
depend on random variables Yj, . . . , Y which will be observed after the decision is 

chosen, where the joint distribution of X , . . . , X , Y. , . . . , Y is not completely 

1 m * 21 

known, but is known to be one of a given class of distributions. In the company's sales 
forecasting problem, the X's were the sales observed before the decision was chosen, 
the Y's are sales that will be observed after the decision is chosen. 

We will see that making the loss depend upon random variables which will be ob - 
served after the decision is chosen, rather than upon the distribution of the random 
variables on which the decision is based, does not change the mathematical analysis 
at all, but does put the problem into the form in which it arises in practice. 

2. Notation. Now we set up a system of notation for the general statistical decision 

problem. X .,..., X denote the random variables on which the decision is to be based: 
* m 

the symbol X denotes the vector X. , . . . , X . Y.,...,Y denote the random variables 

1 ml n 

that will be observed after the decision is chosen: the symbol Y denotes the vector 

Y. , . . . , Y . We assume for simplicity of exposition that X and Y are discrete random 
i n 

variables. The symbol D is an index for the possible decisions: that is, a particular 
value of D picks out a particular decision. The loss we incur when X* x, Y* y , and 
the decision chosen is D is the function W(y;D;x). In many cases, the loss does not 
depend explicitly on X, and we write it W(y; D). is an index for the possible joint 

distributions of X and Y. That is, a particular value of picks out a particular dis- 

th 
tribution. f(x, y; 0) denotes P(X-x and Y*y) under the distribution in our list. 

A decision rule s is defined by nonnegative numbers s(x; D), where s(x; D) is the 
probability assigned by the decision rule s to choosing the decision D when Xx. 
Thus, when D can take on only L different values, say 1, 2, . . . , L, we have 



172 



JL 

S s(x;D)=l for each x. 
D-l 

For each given decision rule s, the loss that will be incurred when using s is a ran- 
dom variable whose probability distribution depends on the unknown joint distribution 

f(x, y; 0). The expected value of the loss that will be incurred when the decision rule 

th 
s is used and the joint distribution is the in our Hat will be denoted by r(0; s). We 

have 

L 

r (0; s) X S S W(y; D; x) f (x, yj 0) s(x; D), 
x D-l y 

assuming a finite number L of possible decisions. If we denote S W(y; D;x)f(x,y; 0) 

7 

L 

by R(0;x;D), then we have r(0; s) 2 2 R(0;x;D) s(x;D), and R(0;x;D) is the loss 

x D-l 

function of the usual formulation of statistical decision theory. But in no practical case 
would R(0;x; D) coincide with the functions usually assumed in standard decision theory. 

3. The evaluation of decision rules. Roughly speaking, we consider a decision rule 
s "good" when r(0; s) is "small" for all 0. To be more precise, suppose we are con- 
sidering two different decision rules, s. and s_, characterized by the decision proba- 
bilities s 1 (x; D) and s (x; D) respectively. Suppose r(Q; & 1 ) <r(0; s^ for all 0, with 
r(0; s.)<r(Q; s ) for at least one value of 0. Then we say that s. is a better decision 
rule than s_, and we would not use s^* A decision rule t is called "inadmissible" if 
there is a decision rule which is better than t according to the definition just given. 
Any decision rule which is not inadmissible is called "admissible. " Whatever decision 
is finally used should be an admissible decision rule, and therefore a method for find- 
ing admissible decision rules is needed. 

4. Baves decision rules. Suppose that there is a finite number h of possible joint 
distributions of X and T, so that we may assume that ranges over the values 

1, .... h. If b(l), . . . , b(h) are nonnegative numbers adding to unity, then a decision 
rule s is called a "Bayes decision rule relative to b(l). . . . ,b(h)" if 

h h 

S b(0)r(0;s)< S b(0)r(0;t) 
0-1 01 
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for each and every decision rule t. A decision rule is called simply a "Bayes decision 
rule" if it is a Bayes decision rule relative to some set of nonnegative numbers adding 
to unity* 

A basic theorem states that any admissible decision rule is a Bayes decision rule. 
However, some Bayes decision rules are inadmissible. It may then be wondered why 
we bother to pay attention to Bayes decision rules, when what we really want are only 
admissible decision rules. The answer is that it is so simple to find the Bayes deci- 
sion rules that it is a useful step in searching for the admissible decision rules. 

If s is a Bayes decision rule relative to b(l), . . . b(h), and all h of these numbers 
are positive, then s must be admissible. For suppose s were not admissible. Then 
there would be a decision rule t with r(0; t)r(0; s) for all 0, with r(0; t)<r(0; s) for at 
least one value of 0, say for 0-j. But then 

h h h 

S b(0)r(0;s)- S b(0)r(0;t) Z b(0) (r(0; s) - r(0; t)]> 
G-l 01 1 



which contradicts the fact that s is a Bayes decision rule relative to b(l), . . . ,b(h), 
and proves that s is admissible. 

To construct a Bayes decision rule relative to given b(l), . . . ,b(h), we note that 

h L , h , 

S b(0)r(0;s)=S Z s(x;D) < S S b(0)W(y; D;x)f{x, y; 9) (. 
01 x Dl (0*1 y ) 

s is to be chosen to minimize this expression. We denote the expression 



h 



S S b(0)W(y; D; x)f (x, y; 6) by K(D; x). Then s is to be chosen to make 
0-1 y 



L 



S 2 s(x;D)K(D;x) as small as possible. Clearly, for each pair x, D, we set 
x D"l 

s(x;D) equal to zero unless K(D;x)min^K(l;x),...,K(L;x)j. If for some x, 

K(D; x) is minimized for more than one value of D, then there is more than one deci- 

sion rule which is Bayes relative to b(l), . . . ,b(h). 

As an illustrative example, suppose a company is faced with the following problem. 
It can buy a certain machine from supplier A, who charges $100 and unconditionally 
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guarantees that the machine will operate satisfactorily. Or it can buy a similar mach- 
ine from supplier B for $70. Supplier B does not guarantee the machine he sells, and 
if it breaks down, the company will have to buy a machine from supplier A for $100. 
The machine from supplier B has been made by an unknown one of two possible fac- 
tories, one of which turns out machines 20% of which are defective, the other factory 
turning out machines 40% of which are defective. Before the company decides, it can 
observe the operation of two other installed machines known to be from the same fac- 
tory that produced the machine offered by supplier B. We turn this into a decision 
problem of the form we have been discussing by introducing the following notation. 
The random variable X. is defined to be equal to if the first installed machine to be 
observed breaks down, equal to 1 otherwise; X is defined in the same way for the 
second installed machine to be observed; Y is defined in the same way in terms of the 
machine offered for sale by supplier B. From the conditions of the problem, 
X . X 2 , Y are all independent and identically distributed, the common distribution 
being one of the two following distributions: 



possible values 





1 


probability 


.2 


.8 


possible values 





1 


probability 


.4 


.6 



0=2 

We label the decision to buy from supplier A as the first decision (D 1), and the deci- 
sion to buy from supplier B as the second decision (D 8 2). Then the loss function, 
which does not depend on Xj, X 2 , is given as follows: 

W(y; 1) - $100 for y or 1, 

W(0;2)-$70 + $100, 

W(l;2)-$70. 

For this problem, it can be verified that any decision rule s with s(0, 0;1) 1 and 
s(l, 1; 1) is a Bayes decision rule relative to . 6, .4. Any such decision rule is ad- 
missible. 

3. Minima* decision rules. Since in any given problem there are usually infinitely 
many admissible decision rules, what further principles can be used to pick one par- 
ticular decision rule from among all the admissible decision rules ? We could claim 
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that it is not the statistician's job to pick one particular decision rule, but to find all 
the admissible decision rules. Then the person who will actually incur the loss should 
pick one particular decision rule from among the admissible decision rules presented 
to him by the statistician. However, some general principles for choosing one parti- 
cular decision rule have been suggested, though none has been universally adopted. 
The most familiar such principle is the minimax principle, which we now describe. 

For any decision rule s, denote max r(Q; s) by M(s). Then the minimax principle 



states: choose an admissible decision rule s which minimizes M(s). In other words, 
use a decision rule which has the smallest maximum expected loss. This principle 
has been criticized for being too conservative. 

What computational techniques can be used to actually find a minimax decision rule? 

Let V denote min M(s), a quantity which will be unknown at the start of our computa- 

s 

tions. Then we have to find the quantities s(x; D) so that r(0; s) <V for = 1, . . . , h 
and so that V is minimized. This is a problem in which our unknowns are s(x; D) and 
V. Since r(0; s) are linear functions of s(x; D), we have a typical linear programming 
problem, to whicn the simplex method, for example, can be applied. Of course this 
is so only when we are dealing with a finite number of possible decisions and distribu- 
tions and our random variables are discrete. However, in many cases infinite situa- 
tions can be satisfactorily approximated by finite situations. 

6, Problems involving a sequence of decisions over time. We denote by Y(j) the 

th 
vector of random variables that will be observed between the j time at which we must 

choose a decision and the (j + 1) time at which we must choose a decision. X denotes, 
as usual, the vector that will be observed before any decision must be chosen. D(j) 
denotes the decision made at the j time. Suppose that a decision must be made at T 
different times. Then our loss will be denoted by 
W(X, D(l), Y( 1). D{2), Y(2), . . . , D(T), Y(T. 

The most important fact about the construction of Bayes decision rules in this case 
is that we must first describe how the decision rule chooses D(T), and then we des- 
cribe how the decision rule chooses D(T - 1), and then how the decision rule chooses 
D(T - 2), etc. In other words, we must work our way backwards in the construction of 

176 



Bayes decision rules. For in order to evaluate the goodness of the decision to be made 
at any given time, we must know how we will proceed in the future (that is, how we 
will make future decisions). 

In choosing decision D(j), we must of course take into account the already known 
values of X,D(1), Y(1),D(2), Y(2),. . . ,D(j-l), Y(j-l). Thus for the problem of choosing 
D(j) the quantities X,D(1), . . . , Y(j-l) play the same role as the quantities Xj, . . . ,X m 
did in the simpler problems where there was only one time when a decision had to be 
made. Furthermore, when we have to choose D(j), we assume that we have already 
described how we will choose D(j + 1), . . . , D(T). This means that D(j + 1), . . . , D(T) 
are expressed in terms of X, D(l), Y(l), . . . , Y( j- 1), D(j), Y(j), Y(j 1), . . . , Y(T- 1). 
Thus, for the problem of choosing D(j), we have eliminated the variables D(j 4- 1), . . . , 
D(T) by expressing them in terms of the other variables. But then the problem of 
choosing D(j) has been turned into a problem of the type discussed above, with the 
following correspondences* In our present problem, X, D(l), . . *, Y(j-l) play the role 

of X. , . . . ,X in the earlier problem; and in our present problem Y(j), Y(j + 1 ),..., 

* m 

Y(T) play the role of Y., . . . , Y^ in our original problem. Then the construction of a 
Bayes decision rule proceeds exactly as before. Of course, the construction of an 
overall Bayes decision rule requires T separate applications of the procedure: one for 
describing how D(T) is to be chosen, then one for describing how D(T-l) is to be 
chosen,.. . , finally one for describing how D(l) is to be chosen. 

Now we describe a numerical example. The period of use of a machine is divided 
into three time periods: Period 1 is from installation to first overhaul; period 2 is 
from first overhaul to second overhaul; period 3 is from second overhaul to the re- 
placement of the machine. D(l) is the amount spent on the first overhaul, D(2) is the 
amount spent on the second overhaul* X is defined to be 1 if the machine breaks down 
during period 1, and if the machine does not break down during period 1. Y(l) and 
Y(2) are defined in the same way for periods 2 and 3 respectively. X, Y{1) and Y(2) 
are assumed to be independent random variables, with P(X 1) -0, P(Y(1} 1) 
0/(l +D(1)), P(Y(2) 1>- 0/(1 + | D(l) + D(2. is an unknown number between and 
1. If the machine breaks down in any period, it costs $1000 to put it back in use. Thus 

+ 1000(X 
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ON CHANNELS IN WHICH THE DISTRIBUTION OF ERROR 
IS KNOWN ONLY TO THE RECEIVER OR ONLY TO THE SENDER 
J. Wolfowitz 

In this note -we illustrate, by means of a set of simultaneous semi- continuous chan- 
nels, some of the results described and proved in [ 1 ] (especially Section 8). In Sec- 
tion 7 of this latter paper extension of these results to many other channels, including 
some with memory, is indicated. The present note, although intentionally brief, re- 
quires for its comprehension no prior knowledge of information theory. For this rea- 
son it is necessary to begin with a number of definitions. The meaning of any non-ma- 
thematical terms which occur in these will either be precisely explained at once or 
will soon become readily apparent. 

Let the input alphabet consist of the (real) numbers a , . . . , a . Define a u- sequence 
as any sequence of n numbers, each of which is one of a , . . . , a. . By a v- sequence we 
shall mean any sequence of n real numbers. The sender sends (or transmits) u-sequen- 
ces through the channel. Any u- sequence transmitted may be garbled by channel noise 
(error). Let u o * (x^ ... ,x n ) be any u-sequence sent. The chance received v-sequence 
V(U Q ) is a sequence of independent chance variables, 

v(u ) (Y^uJ,...^^)), 

where Y^U^XJ has a Gaussian distribution with mean jt and variance v 2 . Let A be 
any set of v- sequences. Then 



the probability that v(u o ) lies in A, is of course a function of /t and o- 2 . 

Let J 1 and J be bounded subsets of the real line; J is to contain non-negative num- 
bers only. The parameters j* and <r 2 lie, respectively, in J and J , and may vary 
arbitrarily from one transmitted u-sequence to another. The case where (/z , <r 2 ) is 
known to both sender and receiver falls within the results of [2}. The case where (^,- 2 
is known to neither sender nor receiver is treated in [1] . Here we shall discuss the 
8 ***** *** unnece " arv condi ttn *at J 2 should be at a positive dis- 
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two cases where I) the receiver knows (jt,<r 2 ) but the sender does not, and II) the sen- 
der knows (M,<r 2 ) but the receiver does not. We shall call these, respectively, chan- 
nel I and channel EL As we have already remarked, the results to be stated are impli- 
cit in [ 1 J . 

A code (N, X ) for channel I is a system of sets 



for every ji in Jj and every <r 2 in J , with the following properties: 

a) u , ...,u are u- sequences. 

b) A^ji. 
are sets of v- sequences. 

c) For every ( M , <r 2 ) the sets A^ M t <r 2 ), . . . , A N ( \t , cr 2 ) are disjoint. 

d) For every ( pt , <r 2 ) we have 

P{v(u ) A^M.o- 2 ) j M <r 2 } > 1-X , i-1 ..... N 

The practical use of such a code is as follows: When the sender wants to send the i 
word of a dictionary of N words, he sends the u- sequence u . When the receiver knows 
that /i , or are, respectively, the mean and variance of the error for a particular u- 
sequence, and the v- sequence received lies in A,( ji , <r z ), the receiver concludes that 
the u- sequence u has been sent. The probability that any word (u- sequence) sent will 
be correctly "decoded" (understood) by the receiver is > 1 - X . 
A code (N, X ) for channel n is a system of sets 

C( M , <r 2 ) - {(u^ M , cr 2 ), Aj ),..., (U N ( M , <r 2 ), A N )} 
for every M in J. and every <r 2 in J , with the following properties: 

a) U(M , <r 2 ), . . .,U(M , <r 2 ) are u-sequences. 



b) A., . . . , A., are disjoint sets of v-sequences. 

c) For every (M , o- 2 ) we have 



The practical use of such a code is as follows: When the sender wants to send the i 
word of a dictionary of N words and knows that M , <r are, respectively, the mean and 
variance of the error for this particular u-sequence, he sends .u t ( ji , <r 2 ). When the re- 
ceiver receives a v-sequence in A. he concludes that the j 1 * 1 word has been sent. The 
probability that any word sent will be correctly decoded by the receiver is > 1 - X . 
The capacity of a channel is a concept fundamental in information theory. As its col- 
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loquial name implies, it is a number which measures a certain capability of the chan- 
nel. For channels I and n the capacities will be implicitly but precisely defined by 
Theorems 1 and 2 below. These theorems hold for channels I and II and are to be un- 
derstood as follows: The number C which occurs in both is to be replaced by Cj, the 
capacity of channel I, when the theorems are applied to channel I, and is to be replaced 
by C , the capacity of channel IE, when the theorems are applied to channel n. 

Theorem 1. Let > o and X , < X < 1, be arbitrary. For n sufficiently large 
there exists a code (2 n ^ C " f *, X ). 

Theorem 2. Let e > o and X , < X < 1, be arbitrary. For n sufficiently large 
there does not exist a code (2 , X ). 

The capacities are related as follows: 

o<c 1 <c 2 . 

In general, C < C . Let C(/i , <r ) be the capacity defined by Shannon ( [3] ; see also, 
for example, [2] ) of the channel where jz , <r are the mean and variance, respecti- 
vely, of the error for every u-sequence transmitted (and this fact is, of course, known 
to both sender and receiver). Then obviously 
C 2 < inf C(p , o- 2 ) 

where the infimum is taken over all n in J. and all <r in J . A general theorem of [ 1] 
implies that actually 

C 2 inf C( M , r 2 ) . 
ft i <r 

The value of C. , also implied by a general theorem of [ 1] , is less simple to des- 
cribe. It will be briefly described in the next paragraph in a manner intelligible only 
to one with some familiarity with information theory. Headers unfamiliar with informa- 
tion theory are invited to omit the next paragraph and are referred to [ 1 ] for a pre- 
cise, and to them intelligible, description of C . 

Let it be any stochastic input on the input alphabet a. f . . . , a.. For given V and 
( ft , o- ) let H( v 9 ft , <r ) be the usual difference between the Shannon entropy of the 
output and the conditional entropy of the output, given the input. Then 
Cj sup inf H( v , ft , <r 2 ) . 

In this notation 
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C 2 - inf supH( ir,/i , er 2 ) 
M > or 1T 

This makes it clear at once that C < C-. 

A general theorem of [1] implies the following: C is the capacity of the channel 
where neither sender nor receiver knows ( M. <r 2 ) which varies arbitrarily from one 
u- sequence to another, M in J and cr 2 in J . 

The channels I and n were selected for the illustrative purpose of this note, from 
among the channels which come within the scope of the results of [1] , for their im- 
portance and simplicity. It is obvious that if either sender or receiver knows jt he can 
compensate for JA by subtracting it from each letter. Moreover, this can clearly be 
done whether or not ji lies in Jj. It follows that C. and C do not depend on J.. (It is 
trivial to verify that neither C( /x , <r ) nor H( it , fi , a- ) depend on jt . ) Why then was 
J (or indeed M ) introduced at all? The answer is that it was introduced in order to 
utilize the results of [1] so as to be able to make the statement of the preceding para- 
graph. Not only is this latter result interesting per se, but taken in conjunction with 
the results for channel I it shows that knowledge of the distribution of error for any 
word (u- sequence) by the receiver alone does not increase the capacity. It is also in- 
teresting that, even when neither sender nor receiver knows the distribution of error 
for any word, the capacity does not depend on J . 

To sum up: As long as p and cr are restricted to J. and J-, respectively, and the 
sender does not know ( M , <r 2 ), the capacity of the channel is the same whether or not 
the receiver knows ( /i , cr ), and does not depend on J . If cr 2 is restricted to J, and 
is unknown to the sender but known to the receiver, and if p is unrestricted on the real 
line but known to the receiver, the capacity of the channel is C . Finally we note that 
obviously 

C 2 -C(0,t) 
where t is the supremum of the points of J . 

Theorem 1 is an example of a coding theorem in information theory, and Theorem 2 
is the "strong" converse of Theorem 1. In the "weak" converse the conclusion is the 
same but the hypothesis is strengthened to require that \ be sufficiently small. For an 
explanation of why the strong converse is a much deeper result than the weak converse 
see [1] or 1 2 1 , and for an example of how they are sometimes confused in the litera- 
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ture see [2} . 

Results analogous to all those above hold if one of M , <r is fixed and only the other 
is allowed to vary. Of course, as we have seen above, the parameter p has a special 
position. 
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